UK Road Safety: Traffic Accidents and Vehicles

The goal of this project is the investigate what causes Serious and Fatal accidents in hopes of preventing and decreasing the number of them. The dataset consists of accident records from the UK over the course of 15+ years. I hope to show the causes of these accidents through visualizations and create an algorithm that can predict the severity of accidents.

The UK government collects and publishes (usually on an annual basis) detailed information about traffic accidents across the country. This information includes, but is not limited to, geographical locations, weather conditions, type of vehicles, number of casualties and vehicle manoeuvres, making this a very interesting and comprehensive dataset for analysis and research.

The data that I'm using is compiled and available through Kaggle and in a less compliled form, here.

Problem: Traffic Accidents
Solution Method: Use data to figure out how to lower the number of accidents and the severity of them.

Importing and Data Merging

In [9]:
#Import modules
import numpy as np
import holidays
import pandas as pd
import seaborn as sns
import pickle
import time
import timeit


import matplotlib.pyplot as plt
plt.style.use('dark_background')
%matplotlib inline

import datetime
import math
from collections import Counter

#scipy
import scipy.stats as stats
from scipy import stats
from scipy.stats import chi2_contingency

#sklearn
import sklearn
from sklearn import ensemble
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score 
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample

#for clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_score

#other learners
from xgboost import XGBClassifier
import lightgbm as lgb
from kmodes.kmodes import KModes

#imblearn
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.ensemble import EasyEnsembleClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

#webscraping
import requests
from bs4 import BeautifulSoup
import re
import urllib
from IPython.core.display import HTML

#time series
import statsmodels.api as sm
from pylab import rcParams
import itertools
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA


#warning ignorer
import warnings
warnings.filterwarnings("ignore")
In [10]:
# # #DATAFRAME PICKLE CREATED IN CELLS BELOW INSTEAD OF RUNNING THROUGH ENTIRE PROCESS AFTER RESTARTING
# # #import pickled file
# df = pd.read_pickle("df.pkl")
# df.to_csv('uktraffic_acc.csv') 
In [10]:
#import files

ac = pd.read_csv(r'Accident_Information.csv', low_memory=False, chunksize=30000)
vc = pd.read_csv(r'Vehicle_Information.csv', low_memory=False, chunksize=30000)

Previously, I did not remove "Data missing or out of range" from the datasets however through cleaning and checking the value counts I decided to do so for sanity purposes only. Most of the percentages that had this as a value were not a high percentage either.

In [3]:
#chunk cleaning and dataframing for accident column
acchunk = []
for chunk in ac:
    acchunk_filter = chunk[
        (chunk.Year.astype(int) >= 2010) &
        (chunk.Year.astype(int) <= 2017) &
        (chunk['Road_Type'] != "Unknown") &
        (chunk['Junction_Control'] != "Data missing or out of range") &
        (chunk['Carriageway_Hazards'] != "Data missing or out of range") &
        (chunk['Junction_Detail'] != "Data missing or out of range") &
        (chunk['Road_Surface_Conditions'] != "Data missing or out of range") &
        (chunk['Special_Conditions_at_Site'] != "Data missing or out of range") &
        (chunk['Weather_Conditions'] != "Data missing or out of range") &
        (chunk['Latitude'].notnull()) &
        (chunk['Longitude'].notnull())
    ]
    acchunk.append(acchunk_filter)
df1 = pd.concat(acchunk)
In [4]:
#chunk cleaning for vehicles column
vcchunk = []
for chunk2 in vc:
    vcchunk_filter = chunk2[
        (chunk2.Year.astype(int) >= 2010)&
        (chunk2.Year.astype(int) <= 2017) &
        (chunk2['Driver_Home_Area_Type'] != "Data missing or out of range") &
        (chunk2['Journey_Purpose_of_Driver'] != "Data missing or out of range") &
        (chunk2['Junction_Location'] != "Data missing or out of range") &
        (chunk2['Was_Vehicle_Left_Hand_Drive'] != "Data missing or out of range") &
        (chunk2['Hit_Object_in_Carriageway'] != "Data missing or out of range") &
        (chunk2['Skidding_and_Overturning'] != "Data missing or out of range") &
        (chunk2['Towing_and_Articulation'] != "Data missing or out of range") &
        (chunk2['Vehicle_Leaving_Carriageway'] != "Data missing or out of range") &
        (chunk2['Vehicle_Manoeuvre'] != "Data missing or out of range") &
        (chunk2['Vehicle_Type'] != "Data missing or out of range") &
        (chunk2['X1st_Point_of_Impact'] != "Data missing or out of range") &
        (chunk2['Sex_of_Driver'] != "Data missing or out of range") &
        (chunk2['Age_Band_of_Driver'] != "Data missing or out of range")
        
    ]
    vcchunk.append(vcchunk_filter)
df2 = pd.concat(vcchunk)
In [5]:
#check columns
print("Accident's Columns:\n",df1.columns, "\n")

print("Vehicle's Columns:\n",df2.columns)
Accident's Columns:
 Index(['Accident_Index', '1st_Road_Class', '1st_Road_Number', '2nd_Road_Class',
       '2nd_Road_Number', 'Accident_Severity', 'Carriageway_Hazards', 'Date',
       'Day_of_Week', 'Did_Police_Officer_Attend_Scene_of_Accident',
       'Junction_Control', 'Junction_Detail', 'Latitude', 'Light_Conditions',
       'Local_Authority_(District)', 'Local_Authority_(Highway)',
       'Location_Easting_OSGR', 'Location_Northing_OSGR', 'Longitude',
       'LSOA_of_Accident_Location', 'Number_of_Casualties',
       'Number_of_Vehicles', 'Pedestrian_Crossing-Human_Control',
       'Pedestrian_Crossing-Physical_Facilities', 'Police_Force',
       'Road_Surface_Conditions', 'Road_Type', 'Special_Conditions_at_Site',
       'Speed_limit', 'Time', 'Urban_or_Rural_Area', 'Weather_Conditions',
       'Year', 'InScotland'],
      dtype='object') 

Vehicle's Columns:
 Index(['Accident_Index', 'Age_Band_of_Driver', 'Age_of_Vehicle',
       'Driver_Home_Area_Type', 'Driver_IMD_Decile', 'Engine_Capacity_.CC.',
       'Hit_Object_in_Carriageway', 'Hit_Object_off_Carriageway',
       'Journey_Purpose_of_Driver', 'Junction_Location', 'make', 'model',
       'Propulsion_Code', 'Sex_of_Driver', 'Skidding_and_Overturning',
       'Towing_and_Articulation', 'Vehicle_Leaving_Carriageway',
       'Vehicle_Location.Restricted_Lane', 'Vehicle_Manoeuvre',
       'Vehicle_Reference', 'Vehicle_Type', 'Was_Vehicle_Left_Hand_Drive',
       'X1st_Point_of_Impact', 'Year'],
      dtype='object')
In [6]:
print('Accident Shape', df1.shape)
print('Vehicle Shape',df2.shape)
Accident Shape (691195, 34)
Vehicle Shape (1167198, 24)
In [7]:
#merge dataframes
df = pd.merge(df1,df2)
In [8]:
#check columns
print("Names of Combined Columns:\n",df.columns, "\n")
print("\nShape:\n",df.shape)
Names of Combined Columns:
 Index(['Accident_Index', '1st_Road_Class', '1st_Road_Number', '2nd_Road_Class',
       '2nd_Road_Number', 'Accident_Severity', 'Carriageway_Hazards', 'Date',
       'Day_of_Week', 'Did_Police_Officer_Attend_Scene_of_Accident',
       'Junction_Control', 'Junction_Detail', 'Latitude', 'Light_Conditions',
       'Local_Authority_(District)', 'Local_Authority_(Highway)',
       'Location_Easting_OSGR', 'Location_Northing_OSGR', 'Longitude',
       'LSOA_of_Accident_Location', 'Number_of_Casualties',
       'Number_of_Vehicles', 'Pedestrian_Crossing-Human_Control',
       'Pedestrian_Crossing-Physical_Facilities', 'Police_Force',
       'Road_Surface_Conditions', 'Road_Type', 'Special_Conditions_at_Site',
       'Speed_limit', 'Time', 'Urban_or_Rural_Area', 'Weather_Conditions',
       'Year', 'InScotland', 'Age_Band_of_Driver', 'Age_of_Vehicle',
       'Driver_Home_Area_Type', 'Driver_IMD_Decile', 'Engine_Capacity_.CC.',
       'Hit_Object_in_Carriageway', 'Hit_Object_off_Carriageway',
       'Journey_Purpose_of_Driver', 'Junction_Location', 'make', 'model',
       'Propulsion_Code', 'Sex_of_Driver', 'Skidding_and_Overturning',
       'Towing_and_Articulation', 'Vehicle_Leaving_Carriageway',
       'Vehicle_Location.Restricted_Lane', 'Vehicle_Manoeuvre',
       'Vehicle_Reference', 'Vehicle_Type', 'Was_Vehicle_Left_Hand_Drive',
       'X1st_Point_of_Impact'],
      dtype='object') 


Shape:
 (720280, 56)
In [9]:
df.describe(include ='all')
Out[9]:
Accident_Index 1st_Road_Class 1st_Road_Number 2nd_Road_Class 2nd_Road_Number Accident_Severity Carriageway_Hazards Date Day_of_Week Did_Police_Officer_Attend_Scene_of_Accident ... Sex_of_Driver Skidding_and_Overturning Towing_and_Articulation Vehicle_Leaving_Carriageway Vehicle_Location.Restricted_Lane Vehicle_Manoeuvre Vehicle_Reference Vehicle_Type Was_Vehicle_Left_Hand_Drive X1st_Point_of_Impact
count 720280 720280 720280.000000 699431 710979.000000 720280 720280 720280 720280 720279.000000 ... 720280 720280 720280 720280 720269.000000 720280 720280.000000 720280 720280 720280
unique 488010 6 NaN 6 NaN 3 6 2557 7 NaN ... 3 6 6 9 NaN 18 NaN 20 2 5
top 2016140142191 A NaN Unclassified NaN Slight None 2015-05-13 Friday NaN ... Male None No tow/articulation Did not leave carriageway NaN Going ahead other NaN Car No Front
freq 15 365390 NaN 473098 NaN 626656 714068 515 119324 NaN ... 484205 659742 713570 663463 NaN 314008 NaN 534189 719140 354438
mean NaN NaN 1051.470728 NaN 607.761499 NaN NaN NaN NaN 1.147353 ... NaN NaN NaN NaN 0.086026 NaN 1.506420 NaN NaN NaN
std NaN NaN 1825.784600 NaN 1593.978070 NaN NaN NaN NaN 0.357791 ... NaN NaN NaN NaN 0.784117 NaN 0.644346 NaN NaN NaN
min NaN NaN 0.000000 NaN 0.000000 NaN NaN NaN NaN 1.000000 ... NaN NaN NaN NaN 0.000000 NaN 1.000000 NaN NaN NaN
25% NaN NaN 0.000000 NaN 0.000000 NaN NaN NaN NaN 1.000000 ... NaN NaN NaN NaN 0.000000 NaN 1.000000 NaN NaN NaN
50% NaN NaN 191.000000 NaN 0.000000 NaN NaN NaN NaN 1.000000 ... NaN NaN NaN NaN 0.000000 NaN 1.000000 NaN NaN NaN
75% NaN NaN 900.000000 NaN 173.000000 NaN NaN NaN NaN 1.000000 ... NaN NaN NaN NaN 0.000000 NaN 2.000000 NaN NaN NaN
max NaN NaN 9999.000000 NaN 9999.000000 NaN NaN NaN NaN 3.000000 ... NaN NaN NaN NaN 9.000000 NaN 91.000000 NaN NaN NaN

11 rows Ă— 56 columns

Data Cleaning

In [10]:
#check corr b/t Location_Easting_OSGR & Location_Northing_OSGR AND Longitude and Latitude

print(df['Location_Easting_OSGR'].corr(df['Longitude']))


print(df['Location_Northing_OSGR'].corr(df['Latitude']))
0.999425701544617
0.9999733124707393
In [11]:
#drop Location_Easting_OSGR & Location_Northing_OSGR
#because they are the similar to Latitude and Longitude 

df = df.drop(['Location_Easting_OSGR', 'Location_Northing_OSGR'], axis=1)
In [12]:
df.shape
Out[12]:
(720280, 54)
In [13]:
#standardize all column names to lowercase, and remove some characters 
#for ease of use in querying
df.columns = map(str.lower, df.columns)
df.columns = df.columns.str.replace('.','')
df.columns = df.columns.str.replace('(','')
df.columns = df.columns.str.replace(')','')
In [14]:
#convert date/time to datetime datatype

df['date'] = pd.to_datetime((df['date']), format= "%Y-%m-%d")
In [15]:
#df.dtypes
In [16]:
#mistyped datatypes

df[['did_police_officer_attend_scene_of_accident',
    'driver_imd_decile','vehicle_reference',
    'vehicle_locationrestricted_lane','1st_road_number',
    '2nd_road_number','driver_imd_decile',
    'pedestrian_crossing-physical_facilities',
   'pedestrian_crossing-human_control']]= df[['did_police_officer_attend_scene_of_accident',
    'driver_imd_decile','vehicle_reference',
    'vehicle_locationrestricted_lane','1st_road_number',
    '2nd_road_number','driver_imd_decile',
    'pedestrian_crossing-physical_facilities',
   'pedestrian_crossing-human_control']].astype('object')
In [17]:
df.columns.to_series().groupby(df.dtypes).groups
Out[17]:
{dtype('<M8[ns]'): Index(['date'], dtype='object'),
 dtype('int64'): Index(['number_of_casualties', 'number_of_vehicles', 'year'], dtype='object'),
 dtype('float64'): Index(['latitude', 'longitude', 'speed_limit', 'age_of_vehicle',
        'engine_capacity_cc'],
       dtype='object'),
 dtype('O'): Index(['accident_index', '1st_road_class', '1st_road_number', '2nd_road_class',
        '2nd_road_number', 'accident_severity', 'carriageway_hazards',
        'day_of_week', 'did_police_officer_attend_scene_of_accident',
        'junction_control', 'junction_detail', 'light_conditions',
        'local_authority_district', 'local_authority_highway',
        'lsoa_of_accident_location', 'pedestrian_crossing-human_control',
        'pedestrian_crossing-physical_facilities', 'police_force',
        'road_surface_conditions', 'road_type', 'special_conditions_at_site',
        'time', 'urban_or_rural_area', 'weather_conditions', 'inscotland',
        'age_band_of_driver', 'driver_home_area_type', 'driver_imd_decile',
        'hit_object_in_carriageway', 'hit_object_off_carriageway',
        'journey_purpose_of_driver', 'junction_location', 'make', 'model',
        'propulsion_code', 'sex_of_driver', 'skidding_and_overturning',
        'towing_and_articulation', 'vehicle_leaving_carriageway',
        'vehicle_locationrestricted_lane', 'vehicle_manoeuvre',
        'vehicle_reference', 'vehicle_type', 'was_vehicle_left_hand_drive',
        'x1st_point_of_impact'],
       dtype='object')}

Nulls and Outliers

In [18]:
df.isnull().sum().sort_values(ascending=False)/df.shape[0]*100
Out[18]:
driver_imd_decile                              25.118565
age_of_vehicle                                 15.287805
model                                          11.636447
engine_capacity_cc                             11.283251
propulsion_code                                10.899928
make                                            5.846476
lsoa_of_accident_location                       5.674738
2nd_road_class                                  2.894569
2nd_road_number                                 1.291303
pedestrian_crossing-physical_facilities         0.006109
pedestrian_crossing-human_control               0.005276
time                                            0.004582
speed_limit                                     0.001805
vehicle_locationrestricted_lane                 0.001527
did_police_officer_attend_scene_of_accident     0.000139
day_of_week                                     0.000000
1st_road_class                                  0.000000
number_of_vehicles                              0.000000
number_of_casualties                            0.000000
1st_road_number                                 0.000000
longitude                                       0.000000
local_authority_highway                         0.000000
local_authority_district                        0.000000
light_conditions                                0.000000
accident_severity                               0.000000
latitude                                        0.000000
carriageway_hazards                             0.000000
date                                            0.000000
junction_detail                                 0.000000
police_force                                    0.000000
junction_control                                0.000000
x1st_point_of_impact                            0.000000
road_surface_conditions                         0.000000
road_type                                       0.000000
vehicle_type                                    0.000000
vehicle_reference                               0.000000
vehicle_manoeuvre                               0.000000
vehicle_leaving_carriageway                     0.000000
towing_and_articulation                         0.000000
skidding_and_overturning                        0.000000
sex_of_driver                                   0.000000
junction_location                               0.000000
journey_purpose_of_driver                       0.000000
hit_object_off_carriageway                      0.000000
hit_object_in_carriageway                       0.000000
driver_home_area_type                           0.000000
age_band_of_driver                              0.000000
inscotland                                      0.000000
year                                            0.000000
weather_conditions                              0.000000
urban_or_rural_area                             0.000000
was_vehicle_left_hand_drive                     0.000000
special_conditions_at_site                      0.000000
accident_index                                  0.000000
dtype: float64
2nd_road_class
In [19]:
# #2nd_road_class
df['2nd_road_class'].value_counts()/df.shape[0]*100
Out[19]:
Unclassified    65.682512
A               15.892292
C                7.591909
B                6.494558
Motorway         1.301716
A(M)             0.142445
Name: 2nd_road_class, dtype: float64

With 40% of non null being unclassified and 39% of the overall 2nd_road_class column being null, I have decided to drop it in it's entirely.

In [20]:
df = df.drop(['2nd_road_class'], axis=1)
driver_imd_decile
In [21]:
#driver_imd_decile
df['driver_imd_decile'].value_counts()/df.shape[0]*100
Out[21]:
2.0     8.366469
3.0     8.281640
4.0     7.986339
1.0     7.888321
5.0     7.717554
6.0     7.530683
7.0     7.195674
8.0     6.948270
9.0     6.803049
10.0    6.163436
Name: driver_imd_decile, dtype: float64

Since the distribution of categories for 'driver_imd_decile seem very similar, I've decided not to use the mode but "method='ffill'"

In [23]:
df['driver_imd_decile'].fillna(method='ffill', inplace=True)
age_of_vehicle
In [24]:
df['age_of_vehicle'].describe()
Out[24]:
count    610165.000000
mean          7.567473
std           4.751567
min           1.000000
25%           4.000000
50%           7.000000
75%          11.000000
max         105.000000
Name: age_of_vehicle, dtype: float64
In [25]:
df['age_of_vehicle'].median()
Out[25]:
7.0

Changing the nulls of "age of vehicle" to median, then creating it as a category

In [26]:
#fillna by 7 
df['age_of_vehicle'].fillna(7, inplace=True)

#group age_of_vehicle
#1=0-3, 2=3-5, 3=5-8, 4=8-11, 5=
def fixedvehicleage(age):
    if age>=0 and age<=120:
        return age
    else:
        return np.nan

df['age_of_vehicle'] = df['age_of_vehicle'].apply(fixedvehicleage)


df['age_of_vehicle'] = pd.cut(df['age_of_vehicle'], 
                             [0,2,5,8,11,14,17,120], labels=['1', '2', '3','4','5','6','7'])
Model
In [27]:
#model
df['model'].value_counts()/df.shape[0]*100
Out[27]:
MISSING                          0.743183
KA                               0.336675
CLIO DYNAMIQUE 16V               0.279058
FIESTA ZETEC                     0.247681
SPRINTER 313 CDI                 0.236158
206 LX                           0.220331
PUNTO ACTIVE                     0.212279
CLIO EXPRESSION 16V              0.199783
YBR 125                          0.194785
FIESTA FINESSE                   0.177570
MINI COOPER                      0.175904
KA COLLECTION                    0.174793
CORSA CLUB 12V                   0.160910
MICRA S                          0.160493
FIESTA ZETEC CLIMATE             0.158827
CORSA CLUB 16V                   0.158272
PUNTO ACTIVE 8V                  0.141889
MINI ONE                         0.141334
KA STYLE                         0.140640
FIESTA STYLE                     0.140223
FIESTA LX                        0.140084
FOCUS ZETEC TDCI                 0.139113
107 URBAN                        0.137308
CORSA SXI                        0.137308
FOCUS ZETEC 100                  0.136614
ASTRA CLUB 8V                    0.136058
SPRINTER 311 CDI LWB             0.133837
ZAFIRA EXCLUSIV                  0.132032
FOCUS ZETEC                      0.125784
206 LOOK                         0.124535
                                   ...   
325 TDS SE TOURING AUTO          0.000139
COMBO 2300 L2H1 CDTI SPORTIVE    0.000139
6 KUMANO D                       0.000139
V70 T S AUTO                     0.000139
THUNDERBIRD LT                   0.000139
420D GRAN COUPE SPORT AUTO       0.000139
ZR + 120                         0.000139
407 ST HDI                       0.000139
T-SPORTER T30 180 TDI LWB        0.000139
A5 S LINE BLACK EDT TFSI QU      0.000139
ZAFIRA SRI CDTI 8V A             0.000139
C3 RHYTHM HDI 16V                0.000139
C230 K SPORT EDITION             0.000139
C50LA-E                          0.000139
208 ACTIVE S-A                   0.000139
C200 AMG LINE PREMIUM + AUTO     0.000139
ORION LX                         0.000139
3.5 LITRE                        0.000139
CLIO EXPRESSION + 16V QS5        0.000139
TRANSPORTER SD SWB               0.000139
CORSARO 1200 VELOCE              0.000139
CLK 200 KOMP. AVANTGARDE         0.000139
BORA S TDI AUTO                  0.000139
GTV V6 LUSSO 24V                 0.000139
A4 SLINE SPEC ED TDI QUAT        0.000139
A5 S LINE SPECIAL ED TFSI C      0.000139
306 D                            0.000139
ESPACE EXECUTIVE TD              0.000139
PRELUDE 4WS AUTO                 0.000139
SCENIC XMOD D-QUE TT NRG DC      0.000139
Name: model, Length: 28664, dtype: float64
In [28]:
df['model'].describe()
Out[28]:
count      636465
unique      28664
top       MISSING
freq         5353
Name: model, dtype: object

Knowing that there are 28824 unique models for the model column I have decided to use the ffill method on it as well.

In [29]:
df['model'].fillna(method='ffill', inplace=True)

Note: A lot of the values of "model' are labeled as "missing". I do not want to change these because the model could have actually been missing from the car from the accident or it could not be recognizable at the time of the accident.

engine_capacity_cc

In [30]:
#engine_capacity_cc
df['engine_capacity_cc'].describe()
Out[30]:
count    639009.000000
mean       1848.094816
std        1573.057956
min           2.000000
25%        1248.000000
50%        1598.000000
75%        1995.000000
max       91000.000000
Name: engine_capacity_cc, dtype: float64

I am going to handle both outliers and the null values of engine_capacity_cc using the ideals of quantiles and the interquartile range (IQR).

In [32]:
#first I'm going to handle both ends of outliers.
#(determine the min and max cuttoffs for detecting the outlier)
q75, q25 = np.percentile(df['engine_capacity_cc'].dropna(), [75 ,25])
iqr = q75 - q25
 
ecmin = q25 - (iqr*1.5)
ecmax = q75 + (iqr*1.5)

print(ecmax)
print(ecmin)
3115.5
127.5

To explain, what I am going to do is use the ecmax number for the maximum engine_capacity_cc and ecmin for my engine_capacity_cc. Then I'm going to take the mean of those and use it as my fillna.

In [33]:
df = df[df['engine_capacity_cc']<=ecmax]
In [34]:
df = df[df['engine_capacity_cc']>=ecmin]
In [35]:
df['engine_capacity_cc'].hist(bins=20)
plt.style.use('dark_background')

I can accept this distribution and will now check and handle their nulls

In [36]:
#check values of 'engine_capacity_cc'
df['engine_capacity_cc'].describe()
Out[36]:
count    569057.000000
mean       1633.351432
std         473.765085
min         128.000000
25%        1299.000000
50%        1598.000000
75%        1968.000000
max        3110.000000
Name: engine_capacity_cc, dtype: float64
In [37]:
df['engine_capacity_cc'].mean()
Out[37]:
1633.3514322818276

Going to round this mean value

In [38]:
df['engine_capacity_cc'].fillna(1652, inplace=True)

Note: After doing the above null fixes, propulsion_code dropped from having 10% null values to 0. (see below). I will continue on and fix lsoa_of_accident_location then drop the rest of the null values with are all <5%.

In [39]:
df.isnull().sum().sort_values(ascending=False)/df.shape[0]*100
Out[39]:
lsoa_of_accident_location                      5.902045
2nd_road_number                                1.317618
make                                           0.063087
pedestrian_crossing-human_control              0.005448
pedestrian_crossing-physical_facilities        0.005096
time                                           0.003866
vehicle_locationrestricted_lane                0.001406
speed_limit                                    0.001230
did_police_officer_attend_scene_of_accident    0.000176
date                                           0.000000
accident_severity                              0.000000
road_type                                      0.000000
road_surface_conditions                        0.000000
police_force                                   0.000000
1st_road_class                                 0.000000
1st_road_number                                0.000000
number_of_vehicles                             0.000000
number_of_casualties                           0.000000
longitude                                      0.000000
day_of_week                                    0.000000
local_authority_highway                        0.000000
local_authority_district                       0.000000
light_conditions                               0.000000
special_conditions_at_site                     0.000000
junction_detail                                0.000000
carriageway_hazards                            0.000000
junction_control                               0.000000
latitude                                       0.000000
x1st_point_of_impact                           0.000000
was_vehicle_left_hand_drive                    0.000000
urban_or_rural_area                            0.000000
vehicle_type                                   0.000000
vehicle_reference                              0.000000
vehicle_manoeuvre                              0.000000
vehicle_leaving_carriageway                    0.000000
towing_and_articulation                        0.000000
skidding_and_overturning                       0.000000
sex_of_driver                                  0.000000
propulsion_code                                0.000000
model                                          0.000000
junction_location                              0.000000
journey_purpose_of_driver                      0.000000
hit_object_off_carriageway                     0.000000
hit_object_in_carriageway                      0.000000
engine_capacity_cc                             0.000000
driver_imd_decile                              0.000000
driver_home_area_type                          0.000000
age_of_vehicle                                 0.000000
age_band_of_driver                             0.000000
inscotland                                     0.000000
year                                           0.000000
weather_conditions                             0.000000
accident_index                                 0.000000
dtype: float64

lsoa_of_accident_location

In [40]:
# #lsoa_of_accident_location
df['lsoa_of_accident_location'].value_counts()
Out[40]:
E01032739    440
E01004736    412
E01000004    410
E01018648    303
E01004689    261
E01002444    231
E01030458    229
E01011365    213
E01016012    203
E01012851    192
E01024335    188
E01010521    185
E01011107    184
E01007913    178
E01023732    176
E01008440    176
E01013607    176
E01009200    175
E01016952    173
E01022677    173
E01031587    171
E01031583    171
E01032740    170
E01023584    168
E01008397    167
E01007611    166
E01003482    165
E01018337    164
E01005131    163
E01024721    162
            ... 
E01020417      1
E01001776      1
E01001842      1
E01005410      1
E01010078      1
E01032493      1
E01024642      1
E01013208      1
E01029822      1
E01030386      1
E01014887      1
E01028665      1
E01012928      1
E01030956      1
E01028815      1
E01033056      1
E01026820      1
E01014938      1
E01001967      1
E01018642      1
E01018682      1
E01024749      1
E01031616      1
W01000170      1
W01000305      1
E01003999      1
E01012436      1
E01021050      1
E01028837      1
W01001389      1
Name: lsoa_of_accident_location, Length: 33936, dtype: int64
In [41]:
df['lsoa_of_accident_location'].describe()
Out[41]:
count        535471
unique        33936
top       E01032739
freq            440
Name: lsoa_of_accident_location, dtype: object

With 35061 unique variable and a high count amount the top variables I am deciding to do ffill again.

In [42]:
df['lsoa_of_accident_location'].fillna(method='ffill', inplace=True)
In [43]:
#### Check nulls for again
df.isnull().sum().sort_values(ascending=False)/df.shape[0]*100
Out[43]:
2nd_road_number                                1.317618
make                                           0.063087
pedestrian_crossing-human_control              0.005448
pedestrian_crossing-physical_facilities        0.005096
time                                           0.003866
vehicle_locationrestricted_lane                0.001406
speed_limit                                    0.001230
did_police_officer_attend_scene_of_accident    0.000176
carriageway_hazards                            0.000000
longitude                                      0.000000
road_type                                      0.000000
road_surface_conditions                        0.000000
police_force                                   0.000000
1st_road_class                                 0.000000
1st_road_number                                0.000000
number_of_vehicles                             0.000000
number_of_casualties                           0.000000
lsoa_of_accident_location                      0.000000
local_authority_highway                        0.000000
date                                           0.000000
local_authority_district                       0.000000
light_conditions                               0.000000
special_conditions_at_site                     0.000000
junction_detail                                0.000000
accident_severity                              0.000000
junction_control                               0.000000
day_of_week                                    0.000000
latitude                                       0.000000
x1st_point_of_impact                           0.000000
was_vehicle_left_hand_drive                    0.000000
urban_or_rural_area                            0.000000
vehicle_type                                   0.000000
vehicle_reference                              0.000000
vehicle_manoeuvre                              0.000000
vehicle_leaving_carriageway                    0.000000
towing_and_articulation                        0.000000
skidding_and_overturning                       0.000000
sex_of_driver                                  0.000000
propulsion_code                                0.000000
model                                          0.000000
junction_location                              0.000000
journey_purpose_of_driver                      0.000000
hit_object_off_carriageway                     0.000000
hit_object_in_carriageway                      0.000000
engine_capacity_cc                             0.000000
driver_imd_decile                              0.000000
driver_home_area_type                          0.000000
age_of_vehicle                                 0.000000
age_band_of_driver                             0.000000
inscotland                                     0.000000
year                                           0.000000
weather_conditions                             0.000000
accident_index                                 0.000000
dtype: float64

Dropping the remaining nulls that are <1%.

In [44]:
#drop the remaining nulls that are <1%
df.dropna(inplace=True)

#last check
df.isnull().sum().sort_values(ascending=False)/df.shape[0]*100
Out[44]:
x1st_point_of_impact                           0.0
speed_limit                                    0.0
road_type                                      0.0
road_surface_conditions                        0.0
police_force                                   0.0
pedestrian_crossing-physical_facilities        0.0
pedestrian_crossing-human_control              0.0
number_of_vehicles                             0.0
number_of_casualties                           0.0
lsoa_of_accident_location                      0.0
longitude                                      0.0
local_authority_highway                        0.0
local_authority_district                       0.0
light_conditions                               0.0
latitude                                       0.0
junction_detail                                0.0
junction_control                               0.0
did_police_officer_attend_scene_of_accident    0.0
day_of_week                                    0.0
date                                           0.0
carriageway_hazards                            0.0
accident_severity                              0.0
2nd_road_number                                0.0
1st_road_number                                0.0
1st_road_class                                 0.0
special_conditions_at_site                     0.0
time                                           0.0
was_vehicle_left_hand_drive                    0.0
urban_or_rural_area                            0.0
vehicle_type                                   0.0
vehicle_reference                              0.0
vehicle_manoeuvre                              0.0
vehicle_locationrestricted_lane                0.0
vehicle_leaving_carriageway                    0.0
towing_and_articulation                        0.0
skidding_and_overturning                       0.0
sex_of_driver                                  0.0
propulsion_code                                0.0
model                                          0.0
make                                           0.0
junction_location                              0.0
journey_purpose_of_driver                      0.0
hit_object_off_carriageway                     0.0
hit_object_in_carriageway                      0.0
engine_capacity_cc                             0.0
driver_imd_decile                              0.0
driver_home_area_type                          0.0
age_of_vehicle                                 0.0
age_band_of_driver                             0.0
inscotland                                     0.0
year                                           0.0
weather_conditions                             0.0
accident_index                                 0.0
dtype: float64
In [45]:
df.shape
Out[45]:
(561135, 53)
In [46]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 561135 entries, 0 to 720279
Data columns (total 53 columns):
accident_index                                 561135 non-null object
1st_road_class                                 561135 non-null object
1st_road_number                                561135 non-null object
2nd_road_number                                561135 non-null object
accident_severity                              561135 non-null object
carriageway_hazards                            561135 non-null object
date                                           561135 non-null datetime64[ns]
day_of_week                                    561135 non-null object
did_police_officer_attend_scene_of_accident    561135 non-null object
junction_control                               561135 non-null object
junction_detail                                561135 non-null object
latitude                                       561135 non-null float64
light_conditions                               561135 non-null object
local_authority_district                       561135 non-null object
local_authority_highway                        561135 non-null object
longitude                                      561135 non-null float64
lsoa_of_accident_location                      561135 non-null object
number_of_casualties                           561135 non-null int64
number_of_vehicles                             561135 non-null int64
pedestrian_crossing-human_control              561135 non-null object
pedestrian_crossing-physical_facilities        561135 non-null object
police_force                                   561135 non-null object
road_surface_conditions                        561135 non-null object
road_type                                      561135 non-null object
special_conditions_at_site                     561135 non-null object
speed_limit                                    561135 non-null float64
time                                           561135 non-null object
urban_or_rural_area                            561135 non-null object
weather_conditions                             561135 non-null object
year                                           561135 non-null int64
inscotland                                     561135 non-null object
age_band_of_driver                             561135 non-null object
age_of_vehicle                                 561135 non-null category
driver_home_area_type                          561135 non-null object
driver_imd_decile                              561135 non-null float64
engine_capacity_cc                             561135 non-null float64
hit_object_in_carriageway                      561135 non-null object
hit_object_off_carriageway                     561135 non-null object
journey_purpose_of_driver                      561135 non-null object
junction_location                              561135 non-null object
make                                           561135 non-null object
model                                          561135 non-null object
propulsion_code                                561135 non-null object
sex_of_driver                                  561135 non-null object
skidding_and_overturning                       561135 non-null object
towing_and_articulation                        561135 non-null object
vehicle_leaving_carriageway                    561135 non-null object
vehicle_locationrestricted_lane                561135 non-null object
vehicle_manoeuvre                              561135 non-null object
vehicle_reference                              561135 non-null object
vehicle_type                                   561135 non-null object
was_vehicle_left_hand_drive                    561135 non-null object
x1st_point_of_impact                           561135 non-null object
dtypes: category(1), datetime64[ns](1), float64(5), int64(3), object(43)
memory usage: 227.4+ MB

More outliers, categorizing, and other cleanup

In [47]:
#detecting outliers of numerical columns (all floats/ints excluding lat/long and year)

df_num = df[['engine_capacity_cc','number_of_casualties','number_of_vehicles','speed_limit']]
In [48]:
df_num.hist( bins=25, grid=False, figsize=(12,8))
plt.style.use('dark_background')

Column 'speed_limit' seems ok and was previously altered 'engine_capacity_cc'. However, 'number_of_casualties', and 'number_of_vehicles',will be evaluated.

In [49]:
# #number_of_casualties
df['number_of_casualties'].value_counts()
Out[49]:
1     391938
2     113736
3      35451
4      12511
5       4621
6       1739
7        599
8        243
9        146
10        52
11        29
12        28
13        15
16        10
14         4
15         3
17         3
24         2
21         2
19         1
22         1
43         1
Name: number_of_casualties, dtype: int64
In [50]:
#create casualities grouping

def casualities(num_cas):
    if num_cas >=1 and num_cas <2:
        return "1"
    elif num_cas >=2 and num_cas <3:
        return "2"
    elif num_cas >=3 and num_cas <4:
        return "3"
    elif num_cas >= 4 and num_cas <5:
        return "4"
    elif num_cas >= 5:
        return "5+"
  
    
In [51]:
#apply function   
df['number_of_casualties']= df['number_of_casualties'].apply(casualities)
In [52]:
#number_of_casualties
df['number_of_casualties'].value_counts()
Out[52]:
1     391938
2     113736
3      35451
4      12511
5+      7499
Name: number_of_casualties, dtype: int64
In [53]:
df['propulsion_code'].value_counts()/df.shape[0]*100
Out[53]:
Petrol                 60.540155
Heavy oil              38.544913
Hybrid electric         0.782699
Gas/Bi-fuel             0.094808
Petrol/Gas (LPG)        0.021207
Electric diesel         0.013188
Gas                     0.002317
New fuel technology     0.000356
Gas Diesel              0.000178
Fuel cells              0.000178
Name: propulsion_code, dtype: float64
In [54]:
#Clean the values for Propulsion Code. 
df['propulsion_code'] = df['propulsion_code'].replace(to_replace="Gas", value="Petrol")
df['propulsion_code'] = df['propulsion_code'].replace(to_replace="Gas/Bi-fuel", value="Bio-fuel")
df['propulsion_code'] = df['propulsion_code'].replace(to_replace="Petrol/Gas (LPG)", value="LPG Petrol")
df['propulsion_code'] = df['propulsion_code'].replace(to_replace="Gas Diesel", value="Diesel")
In [55]:
df['propulsion_code'].value_counts()/df.shape[0]*100
Out[55]:
Petrol                 60.542472
Heavy oil              38.544913
Hybrid electric         0.782699
Bio-fuel                0.094808
LPG Petrol              0.021207
Electric diesel         0.013188
New fuel technology     0.000356
Diesel                  0.000178
Fuel cells              0.000178
Name: propulsion_code, dtype: float64

Feature Manipulation Creation and Engineering

In [56]:
# #unique values
df.nunique().sort_values(ascending=False)
Out[56]:
accident_index                                 412838
longitude                                      356283
latitude                                       346962
lsoa_of_accident_location                       33895
model                                           25688
2nd_road_number                                  5781
1st_road_number                                  5088
date                                             2557
time                                             1439
engine_capacity_cc                               1023
local_authority_district                          380
make                                              226
local_authority_highway                           207
police_force                                       51
vehicle_manoeuvre                                  18
vehicle_type                                       16
number_of_vehicles                                 15
vehicle_reference                                  15
hit_object_in_carriageway                          12
hit_object_off_carriageway                         12
age_band_of_driver                                 11
driver_imd_decile                                  10
vehicle_locationrestricted_lane                    10
weather_conditions                                  9
junction_location                                   9
vehicle_leaving_carriageway                         9
junction_detail                                     9
propulsion_code                                     9
special_conditions_at_site                          8
day_of_week                                         7
journey_purpose_of_driver                           7
year                                                7
age_of_vehicle                                      7
speed_limit                                         7
carriageway_hazards                                 6
pedestrian_crossing-physical_facilities             6
towing_and_articulation                             6
1st_road_class                                      6
skidding_and_overturning                            6
junction_control                                    5
x1st_point_of_impact                                5
light_conditions                                    5
number_of_casualties                                5
road_surface_conditions                             5
road_type                                           5
pedestrian_crossing-human_control                   3
did_police_officer_attend_scene_of_accident         3
accident_severity                                   3
driver_home_area_type                               3
sex_of_driver                                       3
was_vehicle_left_hand_drive                         2
urban_or_rural_area                                 2
inscotland                                          2
dtype: int64
In [57]:
df['date'] = pd.to_datetime(df['date'])
In [58]:
df['month'] = df ['date'].apply(lambda time: time.month)
In [59]:
#creating a weekend feature that includes Friday-Sunday
df['weekend']= np.where(df['day_of_week'].isin(['Friday', 'Saturday', 'Sunday']), 1, 0)
In [225]:
#create time of day feature with Morning Rush, Day, Noon Rush, Afternoon, After Work Rush, Night

#time of day dictionary
timeofdaygroups = {1: "Morning Rush (6-10)",
                   2: "Day (10-12)",
                   3: "Lunch Rush (12-14)",
                   4: "Afternoon (14-16)",
                   5: "After Work Rush (16-18)",
                   6: "Evening (18-22)",
                   7: "Night (22-6)"}
In [61]:
#pull time data and create hour column
df['hour'] = df['time'].str[0:2]
 
#convert to numeric    
df['hour'] =  pd.to_numeric(df['hour'])

#convert to integer
df['hour'] = df['hour'].astype('int')
In [228]:
#create time_of_day grouping

def daygroup(hour):
    if hour >= 6 and hour < 10:
        return "1"
    elif hour >= 10 and hour < 12:
        return "2"
    elif hour >= 12 and hour < 14:
        return "3"
    elif hour >= 14 and hour < 16:
        return "4"
    elif hour >= 16 and hour < 18:
        return "5"
    elif hour >= 18 and hour < 22:
        return "6"
    else:
        return "7"
    
In [229]:
#apply function   
#time of day function
df['time_of_day']= df['hour'].apply(daygroup)   
In [64]:
df[['weekend','day_of_week','time', 'time_of_day']].tail(10)
Out[64]:
weekend day_of_week time time_of_day
720270 0 Wednesday 08:45 1
720271 0 Wednesday 08:45 1
720272 0 Tuesday 18:12 6
720273 1 Sunday 11:00 2
720274 1 Sunday 11:00 2
720275 0 Monday 16:32 5
720276 0 Monday 16:32 5
720277 1 Friday 06:45 1
720278 0 Tuesday 16:45 5
720279 0 Tuesday 16:45 5
In [65]:
#vehicle_type
df['vehicle_type'].value_counts()/df.shape[0]*100
Out[65]:
Car                                      86.052020
Van / Goods 3.5 tonnes mgw or under       5.481212
Motorcycle over 500cc                     3.999929
Taxi/Private hire car                     2.932628
Motorcycle over 125cc and up to 500cc     0.881428
Motorcycle 125cc and under                0.178567
Minibus (8 - 16 passenger seats)          0.140251
Other vehicle                             0.131519
Goods over 3.5t. and under 7.5t           0.083937
Motorcycle 50cc and under                 0.043840
Bus or coach (17 or more pass seats)      0.024771
Goods vehicle - unknown weight            0.015326
Goods 7.5 tonnes mgw and over             0.013188
Motorcycle - unknown cc                   0.013009
Agricultural vehicle                      0.007128
Electric motorcycle                       0.001247
Name: vehicle_type, dtype: float64

I want to condense the vehicle type variables.

In [123]:
#motorcycles
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Motorcycle over 500cc", 
                                                        value="Motorcycle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace=
                                                        "Motorcycle over 125cc and up to 500cc",
                                                        value="Motorcycle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Motorcycle 125cc and under", 
                                                value="Motorcycle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Motorcycle 50cc and under", 
                                                        value="Motorcycle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Electric motorcycle", 
                                                        value="Motorcycle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Motorcycle - unknown cc", 
                                                        value="Motorcycle")


#Goods_vehicle
df['vehicle_type'] = df['vehicle_type'].replace(to_replace=
                                                        "Van / Goods 3.5 tonnes mgw or under", 
                                                        value="Goods Vehicle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Goods over 3.5t. and under 7.5t", 
                                                        value="Goods Vehicle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Goods vehicle - unknown weight", 
                                                        value="Goods Vehicle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Goods 7.5 tonnes mgw and over", 
                                                        value="Goods Vehicle")

#car
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Taxi/Private hire car", 
                                                        value="Car")


#bus
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Minibus (8 - 16 passenger seats)", 
                                                        value="Bus")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace=
                                                        "Bus or coach (17 or more pass seats)", 
                                                        value="Bus")

#other vehicle
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Agricultural vehicle", 
                                                        value="Other Vehicle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Other vehicle", 
                                                        value="Other Vehicle")
In [124]:
#vehicle_type
df['vehicle_type'].value_counts()/df.shape[0]*100
Out[124]:
Car              88.984647
Goods Vehicle     5.593663
Motorcycle        5.118020
Bus               0.165023
Other Vehicle     0.138648
Name: vehicle_type, dtype: float64

Create more condense groups for age band of driver in order to deal with some potential outliers.

In [68]:
#age_band_of_driver 
df['age_band_of_driver'].value_counts()/df.shape[0]*100
Out[68]:
26 - 35    22.598662
36 - 45    20.728345
46 - 55    17.551213
21 - 25    12.208827
56 - 65    10.373796
16 - 20     7.145696
66 - 75     5.624850
Over 75     3.757028
11 - 15     0.011049
6 - 10      0.000356
0 - 5       0.000178
Name: age_band_of_driver, dtype: float64
In [69]:
#I did this before hand because as "Over 75", it wouldnt convert in the codes below
df['age_band_of_driver']=df['age_band_of_driver'].replace("Over 75","75-100")
In [70]:
age1 = ["0 - 5", "6 - 10", "11 - 15"]
age2 = ["16 - 20","21 - 25"]
age3 = ["26 - 35","36 - 45"]
age4 = ["46 - 55", "56 - 65"]
age5 = ["66 - 75", "75-100"]
In [71]:
#over 75 wouldnt work in the string so I did it separately
for (row, col) in df.iterrows():

    if str.lower(col.age_band_of_driver) in age1:
        df['age_band_of_driver'].replace(to_replace=col.age_band_of_driver, 
                                         value='Under 16', inplace=True)

    if str.lower(col.age_band_of_driver) in age2:
        df['age_band_of_driver'].replace(to_replace=col.age_band_of_driver, 
                                         value='16-25', inplace=True)
    
    if str.lower(col.age_band_of_driver) in age3:
        df['age_band_of_driver'].replace(to_replace=col.age_band_of_driver, 
                                         value='26-45', inplace=True)
    if str.lower(col.age_band_of_driver) in age4:
        df['age_band_of_driver'].replace(to_replace=col.age_band_of_driver, 
                                         value='46-65', inplace=True)
    if str.lower(col.age_band_of_driver) in age5:
        df['age_band_of_driver'].replace(to_replace=col.age_band_of_driver, 
                                         value='Over 65', inplace=True)
In [72]:
#age_band_of_driver
print("Distinct responses for age_band_of_driver:\n", set(df['age_band_of_driver']))
Distinct responses for age_band_of_driver:
 {'Over 65', 'Under 16', '46-65', '26-45', '16-25'}
In [73]:
# number_of_vehicles
df['number_of_vehicles'].value_counts()/df.shape[0]*100
Out[73]:
2     72.944835
3     11.924403
1     11.570121
4      2.706122
5      0.582747
6      0.167874
7      0.059166
8      0.021385
11     0.005881
9      0.005881
10     0.004277
14     0.002317
13     0.002139
16     0.001782
12     0.001069
Name: number_of_vehicles, dtype: float64
In [74]:
#group number_of_vehicles

def vehicles(num_veh):
    if num_veh >=1 and num_veh <2:
        return "1"
    elif num_veh >=2 and num_veh <3:
        return "2"
    elif num_veh >=3 and num_veh <4:
        return "3"
    elif num_veh >= 4: 
        return "4+"
  
#apply function   
df['number_of_vehicles']= df['number_of_vehicles'].apply(vehicles)
In [75]:
# number_of_vehicles
df['number_of_vehicles'].value_counts()/df.shape[0]*100
Out[75]:
2     72.944835
3     11.924403
1     11.570121
4+     3.560640
Name: number_of_vehicles, dtype: float64
In [76]:
df['number_of_vehicles'].dtypes
Out[76]:
dtype('O')
In [77]:
df['number_of_vehicles']=df['number_of_vehicles'].astype('object')
In [78]:
#creating seasons column for ML

#creating season column

def getSeason(month):
    if (month == 12 or month == 1 or month == 2):
       return "winter"
    elif(month == 3 or month == 4 or month == 5):
       return "spring"
    elif(month == 6 or month== 7 or month == 8):
       return "summer"
    else:
       return "fall"

df['season'] = df['month'].apply(getSeason)
In [79]:
# number_of_vehicles
df['season'].value_counts()/df.shape[0]*100
Out[79]:
fall      27.065858
summer    25.502241
spring    24.624912
winter    22.806989
Name: season, dtype: float64
In [80]:
#go back to engine capacity CC and crete groups
df.engine_capacity_cc.hist()
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b326feef60>
In [81]:
def enginecap(eng_cc):
    if eng_cc <=1500:
        return "small engine cc"
    if eng_cc >1500 and eng_cc <=2000:
        return "medium engine cc"
    if eng_cc >2000:
        return "large engine cc"


df['engine_capacity_cc_size'] = df['engine_capacity_cc'].apply(enginecap)
In [82]:
df.engine_capacity_cc_size.value_counts()
Out[82]:
medium engine cc    259881
small engine cc     231031
large engine cc      70223
Name: engine_capacity_cc_size, dtype: int64
In [83]:
#Put above pickle in next full run
#create new column for Machine Learning and Visualization with Not Serious and Serious
df['accident_seriousness'] = df['accident_severity']
df['accident_seriousness'] = df['accident_seriousness'].replace(to_replace="Slight", 
                                                                value="Not Serious")
df['accident_seriousness'] = df['accident_seriousness'].replace(to_replace="Serious",
                                                                value="Serious")
df['accident_seriousness'] = df['accident_seriousness'].replace(to_replace="Fatal", 
                                                                value="Serious")
df.shape
Out[83]:
(561135, 60)
In [84]:
df.accident_seriousness.value_counts()
Out[84]:
Not Serious    492804
Serious         68331
Name: accident_seriousness, dtype: int64
In [85]:
#pickling everything to speed up restarting
df.to_pickle("df.pkl")
In [16]:
#import pickled file
df = pd.read_pickle("df.pkl")
df.head()
Out[16]:
accident_index 1st_road_class 1st_road_number 2nd_road_number accident_severity carriageway_hazards date day_of_week did_police_officer_attend_scene_of_accident junction_control ... vehicle_type was_vehicle_left_hand_drive x1st_point_of_impact month weekend hour time_of_day season engine_capacity_cc_size accident_seriousness
0 201001BS70003 B 302 0 Slight None 2010-01-11 Monday 1 Give way or uncontrolled ... Goods Vehicle No Front 1 0 7 1 winter small engine cc Not Serious
1 201001BS70004 A 402 4204 Slight None 2010-01-11 Monday 1 Auto traffic signal ... Car No Front 1 0 18 6 winter medium engine cc Not Serious
3 201001BS70007 Unclassified 0 0 Slight None 2010-01-02 Saturday 1 Give way or uncontrolled ... Car No Nearside 1 1 21 6 winter medium engine cc Not Serious
4 201001BS70007 Unclassified 0 0 Slight None 2010-01-02 Saturday 1 Give way or uncontrolled ... Car No Front 1 1 21 6 winter small engine cc Not Serious
5 201001BS70008 A 3217 3220 Slight None 2010-01-04 Monday 1 Auto traffic signal ... Car No Nearside 1 0 20 6 winter medium engine cc Not Serious

5 rows Ă— 60 columns

General Visualizations

In [278]:
accidentsperyear = df.groupby(['year'])['accident_index'].count()

# prepare plot
plt.style.use('dark_background')
plt.figure(figsize=(10,5))
colors = sns.color_palette("plasma", n_colors=7)
sns.barplot(accidentsperyear.index,accidentsperyear.values, palette=colors)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.title("Accidents Per Year",fontsize=20,fontweight="bold")
plt.xlabel("\nYear", fontsize=15, fontweight="bold")
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.savefig('accidentsperyear.png')
plt.tight_layout()
In [277]:
accidentspermonth = df.groupby(['month'])['accident_index'].count()

# prepare plot
plt.style.use('dark_background')
plt.figure(figsize=(20,10))
colors = sns.color_palette("plasma_r", n_colors=12)
mt=sns.barplot(accidentspermonth.index,accidentspermonth.values, palette=colors)
sns.despine(top=True, right=True, left=True, bottom=True)
#ax is the axes instance
group_labels = ['Jan', 'Feb','Mar','Apr','May','June','July','Aug','Sept','Oct','Nov','Dec' ]

mt.set_xticklabels(group_labels)
plt.title("Accidents Per Month",fontsize=20,fontweight="bold")
plt.xticks(fontsize=18)
plt.yticks(fontsize=12)
plt.xlabel("\nMonth", fontsize=15, fontweight="bold")
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.savefig('accidentspermonth.png')
plt.tight_layout()
In [276]:
weekdays = ['Monday', 'Tuesday','Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday']
accweekday = df.groupby(['year', 'day_of_week']).size()
accweekday = accweekday.rename_axis(['year', 'day_of_week'])\
                               .unstack('day_of_week')\
                               .reindex(columns=weekdays)
plt.figure(figsize=(15,10))
plt.style.use('dark_background')
sns.heatmap(accweekday, cmap='plasma_r')
plt.title('\nAccidents by Weekday per Year\n', fontsize=14, fontweight='bold')
plt.xticks(fontsize=15)
plt.yticks(fontsize=12)
plt.xlabel('')
plt.ylabel('')
plt.savefig('accidentsbyweekdayperyear.png')
plt.show()

Fridays are the day of the week where the most accidents occur.

In [273]:
accidentsperseason = df.groupby(['season'])['accident_index'].count()
seaord=['spring', 'summer', 'fall','winter']
# prepare plot
plt.style.use('dark_background')
plt.figure(figsize=(15,10))

sns.barplot(accidentsperseason.index,accidentsperseason.values, order=seaord, 
            saturation=1, palette='magma_r')
sns.despine(top=True, right=True, left=True, bottom=True)
plt.title("Accidents Per Season",fontsize=20,fontweight="bold")
plt.xticks(fontsize=15)
plt.yticks(fontsize=12)
plt.xlabel("\nSeason", fontsize=15, fontweight="bold")
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.tight_layout()
plt.savefig('accidentsperseason.png')
In [17]:
#"Morning Rush (6-10)", "Day (10-12)", "Lunch Rush (12-14)","Afternoon (14-16)",
#"After Work Rush (16-18)", "Evening (18-22)", "Night (22-6)"

timeofdaygroups = {'1': "Morning Rush",
                   '2': "Day",
                   '3': "Lunch Rush",
                   '4': "Afternoon",
                   '5': "After Work Rush",
                   '6': "Evening",
                   '7': "Night"}
df['time_of_day']=df['time_of_day'].map(timeofdaygroups)
In [267]:
accidentspertod = df.groupby(['time_of_day'])['accident_index'].count()

# prepare plot
plt.style.use('dark_background')
plt.figure(figsize=(15,10))
tod=["Morning Rush", "Day", "Lunch Rush", "Afternoon",
     "After Work Rush", "Evening", "Night"]
sns.barplot(accidentspertod.index,accidentspertod.values, order=tod, palette='rainbow')
sns.despine(top=True, right=True, left=True, bottom=True)
plt.title("Accidents Per Time of Day",fontsize=20,fontweight="bold")
plt.xticks(fontsize=15)
plt.yticks(fontsize=12)

plt.xlabel("", fontsize=15, fontweight="bold")
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.tight_layout()
plt.savefig('accidentspertod.png')

Accident Forecasting with Tableau

In [18]:
%%HTML
<div class='tableauPlaceholder' id='viz1572176706313' style='position: relative'><noscript><a href='https:&#47;&#47;github.com&#47;GenTaylor&#47;Traffic-Accident-Analysis'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ac&#47;AccidentForecasting&#47;AccidentForecasting&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='AccidentForecasting&#47;AccidentForecasting' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ac&#47;AccidentForecasting&#47;AccidentForecasting&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1572176706313');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

According to the forcasting above, traffic accidents will be slightly lower than years before but following similar trends throughout the months.

Correlations

For correlation I used both Pearson and Spearman just in case there would be discrepancies. The order may have slightly varied but the "highest" correlated remained the same.

In [7]:
#correlation by accident severity pearson
corrdf=df.apply(LabelEncoder().fit_transform)
sc = StandardScaler()
corrdf = sc.fit_transform(corrdf)
corrdf=pd.DataFrame(data=corrdf,columns=df.columns)
corr=corrdf.corr()['accident_seriousness']
corr[np.argsort(corr,axis=0)[::-1]]
Out[7]:
accident_seriousness                           1.000000
vehicle_type                                   0.133941
x1st_point_of_impact                           0.076099
speed_limit                                    0.065554
skidding_and_overturning                       0.059322
vehicle_leaving_carriageway                    0.058977
sex_of_driver                                  0.054505
lsoa_of_accident_location                      0.048548
number_of_casualties                           0.043294
age_band_of_driver                             0.039851
junction_control                               0.037966
hit_object_off_carriageway                     0.036040
time_of_day                                    0.026101
model                                          0.022808
accident_index                                 0.022391
junction_location                              0.019548
road_type                                      0.019443
engine_capacity_cc_size                        0.019305
driver_imd_decile                              0.017195
weekend                                        0.017001
date                                           0.016821
propulsion_code                                0.016685
junction_detail                                0.016134
year                                           0.016002
age_of_vehicle                                 0.014322
inscotland                                     0.012642
vehicle_locationrestricted_lane                0.011370
month                                          0.006929
latitude                                       0.006853
carriageway_hazards                            0.004003
1st_road_number                                0.003862
towing_and_articulation                        0.003658
time                                           0.003062
hour                                           0.002450
local_authority_district                       0.002129
pedestrian_crossing-human_control             -0.001036
day_of_week                                   -0.001781
make                                          -0.002377
special_conditions_at_site                    -0.002652
was_vehicle_left_hand_drive                   -0.003057
1st_road_class                                -0.004179
journey_purpose_of_driver                     -0.005085
local_authority_highway                       -0.005379
season                                        -0.007647
2nd_road_number                               -0.008978
police_force                                  -0.009640
hit_object_in_carriageway                     -0.010963
pedestrian_crossing-physical_facilities       -0.012054
light_conditions                              -0.012499
road_surface_conditions                       -0.015441
longitude                                     -0.024353
weather_conditions                            -0.029648
vehicle_reference                             -0.037848
driver_home_area_type                         -0.041281
engine_capacity_cc                            -0.047446
vehicle_manoeuvre                             -0.048098
urban_or_rural_area                           -0.065074
number_of_vehicles                            -0.066120
did_police_officer_attend_scene_of_accident   -0.086731
accident_severity                             -0.973745
Name: accident_seriousness, dtype: float64
In [8]:
corr_spear=corrdf.corr(method='spearman')['accident_seriousness']
corr_spear[np.argsort(corr_spear,axis=0)[::-1]]
Out[8]:
accident_seriousness                           1.000000
vehicle_type                                   0.114374
vehicle_leaving_carriageway                    0.071079
x1st_point_of_impact                           0.067697
speed_limit                                    0.062779
skidding_and_overturning                       0.059746
sex_of_driver                                  0.054629
lsoa_of_accident_location                      0.048538
junction_control                               0.041690
age_band_of_driver                             0.037758
hit_object_off_carriageway                     0.034738
road_type                                      0.028215
time_of_day                                    0.026387
junction_location                              0.024224
junction_detail                                0.024051
accident_index                                 0.022667
number_of_casualties                           0.022535
engine_capacity_cc_size                        0.022157
model                                          0.021979
date                                           0.017136
weekend                                        0.017001
driver_imd_decile                              0.016911
propulsion_code                                0.016522
year                                           0.016282
inscotland                                     0.012642
vehicle_locationrestricted_lane                0.010124
age_of_vehicle                                 0.009042
time                                           0.007971
hour                                           0.007432
month                                          0.006572
1st_road_number                                0.006437
latitude                                       0.006432
towing_and_articulation                        0.004320
carriageway_hazards                            0.003980
local_authority_district                       0.001801
make                                           0.000794
special_conditions_at_site                     0.000097
journey_purpose_of_driver                     -0.000857
day_of_week                                   -0.001860
pedestrian_crossing-human_control             -0.001870
was_vehicle_left_hand_drive                   -0.003057
1st_road_class                                -0.003177
local_authority_highway                       -0.005600
season                                        -0.007485
police_force                                  -0.011075
road_surface_conditions                       -0.015850
pedestrian_crossing-physical_facilities       -0.016310
hit_object_in_carriageway                     -0.017504
light_conditions                              -0.019421
longitude                                     -0.024395
2nd_road_number                               -0.027880
weather_conditions                            -0.028520
engine_capacity_cc                            -0.036678
driver_home_area_type                         -0.041308
vehicle_manoeuvre                             -0.046600
vehicle_reference                             -0.049230
urban_or_rural_area                           -0.065074
number_of_vehicles                            -0.078513
did_police_officer_attend_scene_of_accident   -0.086918
accident_severity                             -0.999548
Name: accident_seriousness, dtype: float64

Looking at this I wanted to visualize some of the higher pos/negative correlations against accident severity.

Chi-Squared Test

Before these visualizations were done, I wanted to be sure that the visualizations were of some importance to accident_seriousness. For this, the chi-squared test was used.

In [12]:
"""chisquare algorithm from 
http://www.insightsbot.com/blog/2AeuRL/chi-square-feature-selection-in-python """

    
class ChiSquare:
    def __init__(self, dataframe):
        self.df = dataframe
        self.p = None #P-Value
        self.chi2 = None #Chi Test Statistic
        self.dof = None
        
        self.dfObserved = None
        self.dfExpected = None
        
    def _print_chisquare_result(self, colX, alpha):
        result = ""
        if self.p<alpha:
            result="The column {0} is IMPORTANT for Prediction".format(colX)
        else:
            result="The column {0} is NOT an important predictor. (Discard {0} from model)".format(colX)

        print(result)
        
    def TestIndependence(self,colX,colY, alpha=0.05):
        X = self.df[colX].astype(str)
        Y = self.df[colY].astype(str)
        
        self.dfObserved = pd.crosstab(Y,X) 
        chi2, p, dof, expected = stats.chi2_contingency(self.dfObserved.values)
        self.p = p
        self.chi2 = chi2
        self.dof = dof 
        
        self.dfExpected = pd.DataFrame(expected, columns=self.dfObserved.columns, 
                                       index = self.dfObserved.index)
        
        self._print_chisquare_result(colX,alpha)

#Initialize ChiSquare Class
cT = ChiSquare(df)

#Feature Selection
testColumns = ['accident_index', '1st_road_class', '1st_road_number','2nd_road_number', 
               'carriageway_hazards', 'date', 'day_of_week', 
               'did_police_officer_attend_scene_of_accident','junction_control', 
               'junction_detail', 'latitude', 'light_conditions', 'local_authority_district',
               'local_authority_highway', 'longitude','lsoa_of_accident_location', 
               'number_of_casualties', 'number_of_vehicles', 'pedestrian_crossing-human_control',
               'pedestrian_crossing-physical_facilities', 'police_force','road_surface_conditions', 
               'road_type', 'special_conditions_at_site', 'speed_limit', 'time', 
               'urban_or_rural_area', 'weather_conditions', 'year', 'inscotland', 
               'age_band_of_driver', 'age_of_vehicle', 'driver_home_area_type', 
               'driver_imd_decile', 'engine_capacity_cc','hit_object_in_carriageway', 
               'hit_object_off_carriageway', 'journey_purpose_of_driver', 'junction_location', 
               'make', 'model','propulsion_code', 'sex_of_driver', 'skidding_and_overturning',
               'towing_and_articulation', 'vehicle_leaving_carriageway',
               'vehicle_locationrestricted_lane', 'vehicle_manoeuvre','vehicle_reference',
               'vehicle_type', 'was_vehicle_left_hand_drive', 'x1st_point_of_impact', 'month',
               'weekend', 'hour', 'time_of_day','season', 'engine_capacity_cc_size']
for var in testColumns:
    cT.TestIndependence(colX=var,colY="accident_seriousness" )  
The column accident_index is IMPORTANT for Prediction
The column 1st_road_class is IMPORTANT for Prediction
The column 1st_road_number is IMPORTANT for Prediction
The column 2nd_road_number is IMPORTANT for Prediction
The column carriageway_hazards is IMPORTANT for Prediction
The column date is IMPORTANT for Prediction
The column day_of_week is IMPORTANT for Prediction
The column did_police_officer_attend_scene_of_accident is IMPORTANT for Prediction
The column junction_control is IMPORTANT for Prediction
The column junction_detail is IMPORTANT for Prediction
The column latitude is IMPORTANT for Prediction
The column light_conditions is IMPORTANT for Prediction
The column local_authority_district is IMPORTANT for Prediction
The column local_authority_highway is IMPORTANT for Prediction
The column longitude is IMPORTANT for Prediction
The column lsoa_of_accident_location is IMPORTANT for Prediction
The column number_of_casualties is IMPORTANT for Prediction
The column number_of_vehicles is IMPORTANT for Prediction
The column pedestrian_crossing-human_control is IMPORTANT for Prediction
The column pedestrian_crossing-physical_facilities is IMPORTANT for Prediction
The column police_force is IMPORTANT for Prediction
The column road_surface_conditions is IMPORTANT for Prediction
The column road_type is IMPORTANT for Prediction
The column special_conditions_at_site is IMPORTANT for Prediction
The column speed_limit is IMPORTANT for Prediction
The column time is IMPORTANT for Prediction
The column urban_or_rural_area is IMPORTANT for Prediction
The column weather_conditions is IMPORTANT for Prediction
The column year is IMPORTANT for Prediction
The column inscotland is IMPORTANT for Prediction
The column age_band_of_driver is IMPORTANT for Prediction
The column age_of_vehicle is IMPORTANT for Prediction
The column driver_home_area_type is IMPORTANT for Prediction
The column driver_imd_decile is IMPORTANT for Prediction
The column engine_capacity_cc is IMPORTANT for Prediction
The column hit_object_in_carriageway is IMPORTANT for Prediction
The column hit_object_off_carriageway is IMPORTANT for Prediction
The column journey_purpose_of_driver is IMPORTANT for Prediction
The column junction_location is IMPORTANT for Prediction
The column make is IMPORTANT for Prediction
The column model is IMPORTANT for Prediction
The column propulsion_code is IMPORTANT for Prediction
The column sex_of_driver is IMPORTANT for Prediction
The column skidding_and_overturning is IMPORTANT for Prediction
The column towing_and_articulation is IMPORTANT for Prediction
The column vehicle_leaving_carriageway is IMPORTANT for Prediction
The column vehicle_locationrestricted_lane is IMPORTANT for Prediction
The column vehicle_manoeuvre is IMPORTANT for Prediction
The column vehicle_reference is IMPORTANT for Prediction
The column vehicle_type is IMPORTANT for Prediction
The column was_vehicle_left_hand_drive is IMPORTANT for Prediction
The column x1st_point_of_impact is IMPORTANT for Prediction
The column month is IMPORTANT for Prediction
The column weekend is IMPORTANT for Prediction
The column hour is IMPORTANT for Prediction
The column time_of_day is IMPORTANT for Prediction
The column season is IMPORTANT for Prediction
The column engine_capacity_cc_size is IMPORTANT for Prediction

Visualizations In Relation to Accident Seriousness

Method:

For my visualizations I have decided to use some of the features with the highest correlations to accident_seriousness:

  • did_police_officer_attend_scene_of_accident
  • x1st_point_of_impact
  • number_of_vehicles
  • speed_limit
  • urban_or_rural_area
  • skidding_and_overturning
  • vehicle_leaving_carriageway
  • sex_of_driver
  • vehicle_type
  • vehicle_manoeuvre
  • engine_capacity_cc
  • number_of_casualties
  • driver_home_area_type
  • age_band_of_driver
  • junction_control
  • hit_object_off_carriageway
  • hit_object_in_carriageway
  • driver_imd_decile *
  • junction_detail *
  • junction_location *
  • propulsion_code *
  • year *

Note: The columns used were selected because of the absolute value of their correlation in relation to accident_seriousness

*columns added after correlation was done after undersampling

For visual reasons, two separate dataframes were created, for not serious and serious accidents. I wanted to better scale the data and for me, this was the simplest way of doing so.

In [9]:
#dataframe where accidents are Slight
not_serious = df[(df['accident_seriousness']=="Not Serious")]
print("Not Serious Group Shape:", not_serious.shape)

not_serious.accident_seriousness.value_counts()
Not Serious Group Shape: (492804, 60)
Out[9]:
Not Serious    492804
Name: accident_seriousness, dtype: int64
In [10]:
#dataframe where accidents are serious
serious= df[(df['accident_seriousness']=="Serious")]

print("Serious Group Shape:", serious.shape)
serious.accident_seriousness.value_counts()
Serious Group Shape: (68331, 60)
Out[10]:
Serious    68331
Name: accident_seriousness, dtype: int64
In [20]:
#map 1, 2, 3 in did_police_officer_attend_scene_of_accident with Yes, No,Self-reported
policeattend = {1: "Yes", 2:"No", 3:"Self-Reported"}
not_serious['did_police_officer_attend_scene_of_accident']=not_serious['did_police_officer_attend_scene_of_accident'].map(policeattend)
df['did_police_officer_attend_scene_of_accident']=df['did_police_officer_attend_scene_of_accident'].map(policeattend)
serious['did_police_officer_attend_scene_of_accident']=serious['did_police_officer_attend_scene_of_accident'].map(policeattend)
In [21]:
imddecile = {1:"Most deprived 10%", 2:"More deprived 10-20%", 3:"More deprived 20-30%", 
             4:"More deprived 30-40%", 5:"More deprived 40-50%", 6:"Less deprived 40-50%", 
             7:"Less deprived 30-40%", 8:"Less deprived 20-30%", 9:"Less deprived 10-20%", 
             10:"Least deprived 10%"}

not_serious['driver_imd_decile']=not_serious['driver_imd_decile'].map(imddecile)
df['driver_imd_decile']=df['driver_imd_decile'].map(imddecile)
serious['driver_imd_decile']=serious['driver_imd_decile'].map(imddecile)
In [22]:
#setups for adding frequencies to visualizations
dftotal= float(len(df))
nstotal= float(len(not_serious))
setotal= float(len(serious))

Did Police Officer Attend Scene Of Accident

In [101]:
#Did Police Officer Attend Scene Of Accident
plt.figure(figsize=(15,10))
ax = sns.countplot("did_police_officer_attend_scene_of_accident", hue="accident_seriousness",  
              palette="PuBu", data=not_serious)
plt.title("Did Police Officer Attend Scene Of Not Serious Accident",
          fontsize=20, fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nAttendance", fontsize=15, fontweight="bold")
plt.legend(fontsize=15, bbox_to_anchor=(1.0, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber Attended", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.3f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('did_police_officer_attend_scene_of_accident_not_serious.png')
plt.show()


#Did Police Officer Attend Scene Of Accident
plt.figure(figsize=(15,10))
ax = sns.countplot("did_police_officer_attend_scene_of_accident", hue="accident_seriousness",  
              palette="PuBu", data=serious)
plt.title("Did Police Officer Attend Scene Of Serious Accident",
          fontsize=20, fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nAttendance", fontsize=15, fontweight="bold")
plt.legend(fontsize=15, bbox_to_anchor=(1.0, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber Attended", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.3f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('did_police_officer_attend_scene_of_accident_serious.png')
plt.show()

First Point of Impact Vs Accident Seriousness

In [102]:
# First Point of Impact Vs Accident Seriousness (Not Serious)
fpoa_order =["Front", "Nearside", "Did not impact", "Back", "Offside"]
plt.figure(figsize=(20,10))
ax = sns.countplot("x1st_point_of_impact", hue="accident_seriousness", order=fpoa_order,  
              palette="PuBu", data=not_serious)
plt.title("First Point of Impact in Not Serious Accidents",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nPoint of Impact", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nFirst Point of Impact Count", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('x1st_point_of_impact_not_serious.png')
plt.show()


# First Point of Impact Vs Accident Seriousness
plt.figure(figsize=(20,10))
ax = sns.countplot("x1st_point_of_impact", hue="accident_seriousness",  order=fpoa_order,
              palette="PuBu", data=serious)
plt.title("First Point of Impact in Serious Accidents",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nPoint of Impact", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nFirst Point of Impact Count", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('x1st_point_of_impact_serious.png')
plt.show()

Number of Vehicles

In [103]:
#number of vehicles vs accidentseriousness
nov_order=["1","2", "3", "4+"]
#notserious
plt.figure(figsize=(20,10))
ax = sns.countplot("accident_seriousness", hue="number_of_vehicles", hue_order=nov_order,
              palette="GnBu_d", data=not_serious)

plt.style.use('dark_background')
plt.title("Number of Vehicles in Not Serious Accidents",
          fontsize=20, fontweight="bold")

plt.xlabel("\nNumber of Vehicles", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('number_of_vehicles_not_serious.png')
plt.show()



#serious
plt.figure(figsize=(20,10))
ax = sns.countplot("accident_seriousness", hue="number_of_vehicles", hue_order=nov_order,
              palette="GnBu_d", data=serious)
plt.style.use('dark_background')
plt.title("Number of Vehicles in Serious Accidents",
          fontsize=20, fontweight="bold")

plt.xlabel("\nNumber of Vehicles", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('number_of_vehicles_serious.png')
plt.show()
Speed Limit vs Accident Seriousness
In [111]:
#notserious
splt_order=[15.0, 20.0,30.0,40.0 ,50.0,60.0, 70.0]
#splt1_order=[20.0,30.0,40.0 ,50.0,60.0, 70.0]
plt.figure(figsize=(20,10))
ax = sns.countplot("speed_limit", hue="accident_seriousness", order=splt_order,
              palette="PuBu", data=not_serious)
plt.title("Speed Limit vs Not Serious Accidents",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nSpeed Limits", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nCount", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.4f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('speed_limit_not_serious.png')
plt.show()

#erious
plt.figure(figsize=(20,10))
ax = sns.countplot("speed_limit", hue="accident_seriousness", 
              palette="PuBu", data=serious)
plt.title("Speed Limit vs Serious Accidents",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nSpeed Limits", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nCount", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.3f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('speed_limit_serious.png')
plt.show()

Urban or Rural Area vs Accident Seriousness

In [112]:
#urban_or_rural_area vs accident seriousness
plt.figure(figsize=(20,10))
ax = sns.countplot("accident_seriousness",  hue="urban_or_rural_area",
              palette="PuBu", data=not_serious)
plt.title("Urban  or Rural  Area vs Accident Severity",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nSeverity", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nUrban or Rural Area Count", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('urban_or_rural_area_not_serious.png')
plt.show()

#urban_or_rural_area vs accident seriousness
plt.figure(figsize=(20,10))
ax = sns.countplot("accident_seriousness",  hue="urban_or_rural_area",
              palette="PuBu", data=serious)
plt.title("Urban  or Rural  Area vs Accident Severity",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nSeverity", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nUrban or Rural Area Count", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('urban_or_rural_area_serious.png')
plt.show()

Skidding and Overturning vs Seriousness

In [116]:
#Not Serious Accident
sao_order=["None", "Skidded", "Skidded and overturned", "Overturned", "Jackknifed", 
           "Jackknifed and overturned"]

plt.figure(figsize=(15,10))
ax = sns.countplot("accident_seriousness", hue="skidding_and_overturning", hue_order=sao_order,
              palette="magma", data=not_serious)
plt.style.use('dark_background')
plt.title("Skidding and Overturning in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Skidding and Overturning", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold") 
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.3f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)

plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('skidding_and_overturning_not_serious.png')
plt.show()


#Serious Accident Manuevers
plt.figure(figsize=(15,10))
ax= sns.countplot("accident_seriousness", hue="skidding_and_overturning", hue_order=sao_order,
              palette="magma", data=serious)
plt.style.use('dark_background')
plt.title("Skidding and Overturning in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Skidding and Overturning", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold") 
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.3f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)

plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('skidding_and_overturning_serious.png')
plt.show()

Vehicle Leaving Carriageway vs Seriousness

In [118]:
#Not Serious Accident Manuevers
vlc_order=["Did not leave carriageway", "Straight ahead at junction", "Nearside", 
           "Offside", "Offside on to central reservation", "Nearside and rebounded", 
           "Offside - crossed central reservation", "Offside and rebounded", 
           "Offside on to centrl res + rebounded"]

plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_leaving_carriageway", hue_order=vlc_order,
              palette="plasma", data=not_serious)
plt.style.use('dark_background')
plt.title("Vehicle Leaving Carriageway  in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Vehicle Leaving Carriageway ", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents\n", fontsize=15, fontweight="bold") 
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.3f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_leaving_carriageway_not_serious.png')
plt.show()


#Serious Accident Manuevers
plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_leaving_carriageway", hue_order=vlc_order,
              palette="plasma", data=serious)
plt.style.use('dark_background')
plt.title("Vehicle Leaving Carriageway  in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Vehicle Leaving Carriageway ", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents\n", fontsize=15, fontweight="bold") 
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.3f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)

plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_leaving_carriageway_serious.png')
plt.show()

Sex of Driver vs Seriousness

In [121]:
#sex_of_driver
sod_order=["Female", "Male", "Not known"]
plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="sex_of_driver", hue_order=sod_order,
              palette="magma", data=not_serious)
plt.style.use('dark_background')
plt.title("Sex of Driver in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSex of Driver", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold") 
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)

plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('sex_of_driver_not_serious.png')
plt.show()

#sex_of_driver serious
plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="sex_of_driver", hue_order=sod_order,
              palette="magma", data=serious)
plt.style.use('dark_background')
plt.title("Sex of Driver in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSex of Driver", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold") 
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)

plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('sex_of_driver_serious.png')
plt.show()
In [122]:
#sex_of_driver
df['sex_of_driver'].value_counts()/df.shape[0]*100
Out[122]:
Male         62.289645
Female       37.562262
Not known     0.148093
Name: sex_of_driver, dtype: float64

Vehicle Type vs Seriousness

In [126]:
#Not Serious Accident Type
vt_order=['Bus', 'Car', 'Goods Vehicle', 'Motorcycle', 'Other Vehicle']

plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_type", hue_order=vt_order,
                 palette="tab20", data=not_serious)
plt.style.use('dark_background')
plt.title("Vehicle Type in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accidents by Vehicle Type", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold") 
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)

plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_type_not_serious.png')
plt.show()


#Serious Accident Type
plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_type", hue_order=vt_order,
              palette="tab20", data=serious)
plt.style.use('dark_background')
plt.title("Vehicle Type in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accidents by Vehicle Type", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold") 
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)

plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_type_serious.png')
plt.show()

Vehicle Manoeuvres

In [128]:
#Not Serious Accident Manuevers

vm_order=['Turning right', 'Going ahead other', 'Going ahead right-hand bend',
          'Slowing or stopping', 'Turning left', 'Waiting to go - held up',
          'Waiting to turn right', 'Overtaking static vehicle - offside' ,
          'Parked', 'Overtaking - nearside', 'U-turn', 'Changing lane to right', 
          'Reversing', 'Waiting to turn left', 'Changing lane to left',
          'Going ahead left-hand bend', 'Overtaking moving vehicle - offside', 'Moving off']

plt.figure(figsize=(20,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_manoeuvre", hue_order=vm_order,
              palette="tab20", data=not_serious)
plt.style.use('dark_background')
plt.title("Vehicle Manuevers in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Vehicle Manuevers", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold") 
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_manoeuvre_not_serious.png')
plt.show()


#Serious Accident Manuevers
plt.figure(figsize=(20,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_manoeuvre",hue_order=vm_order,
              palette="tab20", data=serious)
plt.style.use('dark_background')
plt.title("Vehicle Manuevers in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Vehicle Manuevers", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold") 
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)

plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_manoeuvre_serious.png')
plt.show()

Driver Home Type Area

In [130]:
#driver_home_area_type
dhoa_order=['Urban area', 'Rural', 'Small town']
#Serious Accident Driver Home Type Area
plt.figure(figsize=(20,15))
ax= sns.countplot("accident_seriousness", hue="driver_home_area_type", hue_order=dhoa_order,
              palette="rainbow", data=not_serious)

plt.style.use('dark_background')
plt.title("Accident Driver Home Type Area in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSeriousness", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
#plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('driver_home_area_type_not_serious.png')
plt.show()


#driver_home_area_type

#Serious Accident Driver Home Type Area
plt.figure(figsize=(20,15))
ax= sns.countplot("accident_seriousness", hue="driver_home_area_type", hue_order=dhoa_order,
              palette="rainbow", data=serious)

plt.style.use('dark_background')
plt.title("Accident Driver Home Type Area in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSeriousness", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
#plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('driver_home_area_type_serious.png')
plt.show()

Age Band of Driver

In [131]:
#age_band_of_driver
abod_order=['Under 16', '16-25', '26-45', '46-65','Over 65']
#Not Serious Accident age_band_of_driver
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="age_band_of_driver", hue_order=abod_order,
              palette="magma", data=not_serious)

plt.style.use('dark_background')
plt.title("Not Serious Accident by Age Band of Driver",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Age Band of Driver", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
#plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('age_band_of_driver_not_serious.png')
plt.show()


#Serious Accident age_band_of_driver
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="age_band_of_driver", hue_order=abod_order,
              palette="magma", data=serious)

plt.style.use('dark_background')
plt.title("Serious Accident by Age Band of Driver",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Age Band of Driver", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
#plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('age_band_of_driver_serious.png')
plt.show()

Junction Control

In [133]:
#junction_control
jc_order = ['Give way or uncontrolled', 'Auto traffic signal', 'Authorised person',
            'Stop sign','Not at junction or within 20 metres']
#Not Serious Accident junction_control
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_control", hue_order=jc_order,
              palette="magma", data=not_serious)

plt.style.use('dark_background')
plt.title("Not Serious Accident by Junction Control",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Junction Control", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_control_not_serious.png')
plt.show()

#Serious Accident junction_control
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_control",hue_order=jc_order,
              palette="magma", data=serious)

plt.style.use('dark_background')
plt.title("Serious Accident by Junction Control",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Junction Control", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_control_serious.png')
plt.show()

Hit Object Off Carriageway

In [135]:
#hit_object_off_carriageway
hooffc_order=['None', 'Lamp post', 'Road sign or traffic signal', 'Other permanent object',
              'Entered ditch', 'Tree', 'Near/Offside crash barrier','Central crash barrier',
              'Bus stop or bus shelter', 'Telegraph or electricity pole', 'Submerged in water',
              'Wall or fence']
#Not Serious Accident hit_object_off_carriageway
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="hit_object_off_carriageway", hue_order=hooffc_order,
              palette="plasma", data=not_serious)

plt.style.use('dark_background')
plt.title("Not Serious Accident by Hit Object Off Carriageway",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Hit Object Off Carriageway", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('hit_object_off_carriageway_not_serious.png')
plt.show()

#Serious Accident hit_object_off_carriageway
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="hit_object_off_carriageway", hue_order=hooffc_order,
              palette="plasma", data=serious)
plt.style.use('dark_background')
plt.title("Serious Accident by Hit Object Off Carriageway",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Hit Object Off Carriageway", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('hit_object_off_carriageway_serious.png')
plt.show()

Hit Object In Carriageway

In [222]:
#hit_object_in_carriageway
hoinc_order=['None', 'Kerb', 'Other object', 'Bollard or refuge', 'Parked vehicle',
             'Road works', 'Open door of vehicle', 'Central island of roundabout',
             'Previous accident', 'Bridge (side)', 'Any animal (except ridden horse)',
             'Bridge (roof)']
#Not Serious Accident hit_object_in_carriageway
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="hit_object_in_carriageway", hue_order=hoinc_order,
              palette="plasma", data=not_serious)

plt.style.use('dark_background')
plt.title("Not Serious Accident by Hit Object in Carriageway",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Hit Object in Carriageway", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('hit_object_in_carriageway_not_serious.png')
plt.show()

#Serious Accident hit_object_in_carriageway
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="hit_object_in_carriageway", hue_order=hoinc_order,
              palette="plasma", data=serious)
plt.style.use('dark_background')
plt.title("Serious Accident by Hit Object in Carriageway",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Hit Object in Carriageway", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('hit_object_in_carriageway_serious.png')
plt.show()

Driver IMD Decile

In [221]:
#driver_imd_decile
imd_order=["Least deprived 10%", "Less deprived 10-20%", "Less deprived 20-30%", 
           "Less deprived 30-40%","Less deprived 40-50%","Most deprived 10%",
           "More deprived 10-20%", "More deprived 20-30%", "More deprived 30-40%",
           "More deprived 40-50%"]
#Not Serious Accident driver_imd_decile
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="driver_imd_decile", hue_order=imd_order,
              palette="plasma", data=not_serious)

plt.style.use('dark_background')
plt.title("Not Serious Accident by Driver Area Deprivation Score",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Driver Area Deprivation Score", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('driver_imd_decile_not_serious.png')
plt.show()


#Serious Accident driver_imd_decile
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="driver_imd_decile", hue_order=imd_order,
              palette="plasma", data=serious)

plt.style.use('dark_background')
plt.title("Serious Accident by Driver Area Deprivation Score",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Driver Area Deprivation Score", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('driver_imd_decile_serious.png')
plt.show()

Junction Detail

In [210]:
#junction_detail
jud_order=['T or staggered junction', 'Mini-roundabout', 'Crossroads',
           'Private drive or entrance', 'More than 4 arms (not roundabout)',
           'Roundabout', 'Slip road', 'Other junction','Not at junction or within 20 metres']
#Not Serious Accident junction_detail
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_detail", hue_order=jud_order,
              palette="plasma", data=not_serious)

plt.style.use('dark_background')
plt.title("Not Serious Accident by Junction Detail",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Junction Detail", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_detail_not_serious.png')
plt.show()


#Serious Accident junction_detail
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_detail", hue_order=jud_order,
              palette="plasma", data=serious)

plt.style.use('dark_background')
plt.title("Serious Accident by Junction Detail",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Junction Detail", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_detail_serious.png')
plt.show()

Junction Location

In [211]:
#junction_location
jul_order=['Mid Junction - on roundabout or on main road', 'Entering main road',
           'Approaching junction or waiting/parked at junction approach',
           'Cleared junction or waiting/parked at junction exit', 'Leaving main road',
           'Leaving roundabout', 'Entering roundabout', 'Entering from slip road',
           'Not at or within 20 metres of junction']
#Not Serious Accident junction_location
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_location", hue_order=jul_order,
              palette="plasma", data=not_serious)

plt.style.use('dark_background')
plt.title("Not Serious Accident by Junction Location",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Junction Location", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_location_not_serious.png')
plt.show()


#Serious Accident junction_location
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_location", hue_order=jul_order,
              palette="plasma", data=serious)

plt.style.use('dark_background')
plt.title("Serious Accident by Junction Location",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Junction Location", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_location_serious.png')
plt.show()

Propulsion Code

In [218]:
#propulsion_code
pd_order=['Petrol', 'Heavy oil', 'Hybrid electric', 'Bio-fuel', 'LPG Petrol', 'Diesel',
          'Fuel cells', 'New fuel technology', 'Electric diesel']
pd_order2=['Petrol', 'Heavy oil', 'Hybrid electric', 'Bio-fuel', 'LPG Petrol', 'Electric diesel']
#Not Serious Accident propulsion_code
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="propulsion_code", hue_order=pd_order,
              palette="plasma", data=not_serious)

plt.style.use('dark_background')
plt.title("Not Serious Accident by Propulsion Code",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Propulsion Code", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('propulsion_code_not_serious.png')
plt.show()


#Serious Accident propulsion_code
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="propulsion_code", hue_order=pd_order2,
              palette="plasma", data=serious)

plt.style.use('dark_background')
plt.title("Serious Accident by Propulsion Code",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Propulsion Code", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('propulsion_code_serious.png')
plt.show()

Year

In [230]:
#yeare
year_order=[2010, 2011, 2012, 2013, 2014, 2015, 2016]

#Not Serious Accident yeare
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="year", hue_order=year_order,
              palette="plasma", data=not_serious)

plt.style.use('dark_background')
plt.title("Not Serious Accident by Year",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/nstotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('year_not_serious.png')
plt.show()


#Serious Accident year
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="year", hue_order=year_order,
              palette="plasma", data=serious)

plt.style.use('dark_background')
plt.title("Serious Accident by Year",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/setotal*100),
            ha="center",fontsize=15) 
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('year_serious.png')
plt.show()

Visualization Summary

  • did_police_officer_attend_scene_of_accident: Police attended most accidents but were less likely to NOT be called in serious accidents.
  • x1st_point_of_impact: Majority of accidents were front impacted as the first point of impact. Not serious accidents had a higher percentage of Back impact accidents than serious accidents. Serious accidents had higher percentages of Offside and Nearside accidents.
  • number_of_vehicles: Nothing significant.
  • speed_limit: Majority of accidents occurred in 30 speed limit zones. It would have been beneficial to have actual data on the speeds of the vehicles involved or at least if they were speeding.
  • urban_or_rural_area: Rural areas had a higher percentage of serious accidents. This may relate to hospital locations or emergency vehicle arrival data which was not available.
  • skidding_and_overturning: Higher percentages of serious accidents involved skidding, jackknifing or overturning.
  • vehicle_leaving_carriageway: Most vehicles did not leave the carriageway in either type of accident, however serious accidents had higher percentages of those that did leave the carriageway.
  • sex_of_driver: Men were more involved in both serious and not serious accidents, however according to racfoundation.org, there are only 355 of female privately registered cars on UK roads.
  • vehicle_type: Motorcycles were involved in a significantly higher percentage of serious accidents than not serious accidents
  • vehicle_manoeuvre: Nothing significant.
  • driver_home_area_type: Rural and Small Towns has higher percentages of serious accidents. This may relate to hospital locations or emergency vehicle arrival data which was not available.
  • age_band_of_driver: The age bands over the age of 25 had a higher percentage of serious accidents than not serious.
  • junction_control: Most areas with accidents were uncontrolled.
  • hit_object_off_carriageway: The majority of accidents did not involve objects being hit off the carriageway, however serious accidents had higher percentages of accidents that did involve hitting an object off the carriageway.
  • hit_object_in_carriageway: Most accidents did not involve objects being hit in the carriageway; however serious accidents had higher percentages of accidents that did involve hitting an object off the carriageway.
  • driver_imd_decile: Nothing significant. Most accidents occurred in areas that were Less deprived 20-30%
  • junction_detail: T or staggered junctions were where most of the accidents occurred.
  • junction_location: Nothing that separates the two serious types. However, most accidents seem to have occurred in Mid Junction - on roundabout or on main road or situations where the driver was approaching junction or waiting/parked at junction approach.
  • propulsion_code: Diesel, Fuel cells, New fuel technology, vehicles were not recorded as a part of serious accidents.
  • year: There has been a spike in percentage of serious accidents over the years. However, the percentage of not serious accidents has remained somewhat consistent

Other Visualizations

Due to the previous visualizations a comparison of certain variables was desired to see more correlations.

Junction Control by Junction Detail

In [246]:
#Not Serious Accident
plt.figure(figsize=(20,15))
ax=sns.countplot("junction_control", hue="junction_detail",
              palette="plasma", data=df)

plt.style.use('dark_background')
plt.title("Junction Control by Junction Detail",fontsize=25,fontweight="bold")
plt.xlabel("\nAccident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")

plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
# plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=False)
plt.savefig('junction_control_by_junction_detail.png')
plt.show()

Junction Control by Junction Location

In [245]:
plt.figure(figsize=(20,15))
ax=sns.countplot("junction_control", hue="junction_location",
              palette="plasma", data=df)

plt.style.use('dark_background')
plt.title("Junction Control by Junction Location in Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nAccident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")

plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
# plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=False)
plt.savefig('junction_control_by_junction_location.png')
plt.show()

First point of Impact by Junction Detail

In [248]:
plt.figure(figsize=(20,15))
ax=sns.countplot("x1st_point_of_impact", hue="junction_detail",
              palette="plasma", data=df)

plt.style.use('dark_background')
plt.title("First point of Impact by Junction Detail",fontsize=25,fontweight="bold")
plt.xlabel("\nAccident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")

plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
# plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=False)
plt.savefig('x1st_point_of_impact_by_junction_detail.png')
plt.show()

First point of Impact by Junction Location

In [247]:
plt.figure(figsize=(20,15))
ax=sns.countplot("x1st_point_of_impact", hue="junction_location",
              palette="plasma", data=df)

plt.style.use('dark_background')
plt.title("First point of Impact by Junction Location",fontsize=25,fontweight="bold")
plt.xlabel("\nAccident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")

plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
# plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=False)
plt.savefig('x1st_point_of_impact_by_junction_location.png')
plt.show()

Junction Control and First Point of Impact

In [249]:
plt.figure(figsize=(20,15))
ax=sns.countplot("x1st_point_of_impact", hue="junction_control",
              palette="plasma", data=df)

plt.style.use('dark_background')
plt.title("First point of Impact by Junction Control",fontsize=25,fontweight="bold")
plt.xlabel("\nAccident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")

plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
# plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=False)
plt.savefig('x1st_point_of_impact_by_junction_control.png')
plt.show()

Other Visualizations Summary

No matter the situation above, the most accidents were involving areas that were uncontrolled. One of the main ones were the junction Detail T or staggered junction.

Other areas of concern include Mid Junctions on roundabouts or main roads and areas approaching a junction were cars were either parking or waiting in the junction.

Solution

From the data above more controlled areas would be benefical. Maybe signs alerting drivers of the upcoming junctions, traffic lights, or stop signs would help in some of these areas where they are feasible.

staggered-junctions.jpg

For example, this is a staggered junction, the main junction detail in accidents. One can understand how a situation such as these can lead to numerous accidents especially if proper signage is not available. Perhaps traffic lights, stop signs, or warnings indicating that they are approaching certain junctions would help reduce accidents.

Web Scraping

Below you wll find a web scrape of the website, Learner Driving Centres, which contains information on road signs in the UK. They were pulled to show examples of signage available to be placed.

In [11]:
#request website
r = requests.get('https://www.learnerdriving.com/learn-to-drive/highway-code/road-signs')

#parse HTML
soup = BeautifulSoup(r.text, 'html.parser')

#filter results
results = soup.find_all('div', attrs={'class':'fifth'})
In [12]:
#done to find specific results area
first_result=results[0]
first_result
first_result.find('img')['src']
Out[12]:
'/images/highway-code/entry-to-20-mph-zone.png'
In [13]:
#get images of signs and sign descriptions 
signage = []
for result in results:
    sign=result.find('img')['src']
    sign_desc=result.contents[1]
    signage.append((sign, sign_desc))
In [14]:
#put pulled UK Traffic Signs into dataframe
uktrafficsigns = pd.DataFrame(signage, columns=['Sign', 'Sign Description'])
uktrafficsigns.head()
Out[14]:
Sign Sign Description
0 /images/highway-code/entry-to-20-mph-zone.png Entry to 20 mph zone
1 /images/highway-code/end-of-20-mph-zone.png End of 20 mph zone
2 /images/highway-code/maximum-speed.png Maximum speed
3 /images/highway-code/national-speed-limit-appl... National speed limit applies
4 /images/highway-code/school-crossing-patrol.png School crossing patrol
In [15]:
'''
the "image" is just part of the image link, 
must parse the first half in order to have full image link

'''
uktrafficsigns['Sign'] = 'https://www.learnerdriving.com/'+uktrafficsigns['Sign'] 
uktrafficsigns.head()
Out[15]:
Sign Sign Description
0 https://www.learnerdriving.com//images/highway... Entry to 20 mph zone
1 https://www.learnerdriving.com//images/highway... End of 20 mph zone
2 https://www.learnerdriving.com//images/highway... Maximum speed
3 https://www.learnerdriving.com//images/highway... National speed limit applies
4 https://www.learnerdriving.com//images/highway... School crossing patrol
In [16]:
'''
In some coding below I saw that one of the fields was blank (at index 42) but was not reading as null.
In order to fix that I changed the "Sign Description" and decided to place it here.
'''
uktrafficsigns.at[42,'Sign Description']="T-junction with priority over vehicles from the right"
In [17]:
#I wanted to save this as a csv for later, and to stop unnecessary web scraping
uktrafficsigns.to_csv('uktrafficsigns.csv', header=False, index=False) 
In [18]:
#I wanted the html to show up as images instead of links
def path_to_image_html(path):
    return '<img src="'+ path + '" width="60" >'

pd.set_option('display.max_colwidth', -1)
ukts=HTML(uktrafficsigns.to_html(escape=False ,formatters=dict(Sign=path_to_image_html)))
HTML(uktrafficsigns.to_html(escape=False ,formatters=dict(Sign=path_to_image_html)))
Out[18]:
Sign Sign Description
0 Entry to 20 mph zone
1 End of 20 mph zone
2 Maximum speed
3 National speed limit applies
4 School crossing patrol
5 Stop and give way
6 Give way to traffic on major road
7 Manually operated temporary
8 STOP and GO signs
9 No entry for vehicular traffic
10 No vehicles except bicycles being pushed
11 No cycling
12 No motor vehicles
13 No buses (over 8 passenger seats)
14 No overtaking
15 No towed caravans
16 No vehicles carrying explosives
17 No vehicle or combination of vehicles over length shown
18 No vehicles over height shown
19 No vehicles over width shown
20 Give priority to vehicles from opposite direction
21 No right turn
22 No left turn
23 No U-turns
24 No goods vehicles over maximum gross weight shown (in tonnes) except for loading and unloading
25 Ahead only
26 Turn left ahead (right if symbol reversed)
27 Turn left (right if symbol reversed)
28 Keep left (right if symbol reversed)
29 Vehicles may pass either side to reach same destination
30 Mini-roundabout (roundabout circulation - give way to vehicles from the immediate right)
31 Route to be used by pedal cycles only
32 Segregated pedal cycle and pedestrian route
33 Minimum speed
34 End of minimum speed
35 Distance to 'STOP' line ahead
36 Dual carriage-way ends
37 Road narrows on right (left if symbol reversed)
38 Road narrows on both sides
39 Distance to 'Give Way' line ahead
40 Crossroads
41 Junction on bend ahead
42 T-junction with priority over vehicles from the right
43 Staggered junction
44 Traffic merging from left ahead
45 Double bend first to left (symbol may be reversed)
46 Bend to right (or left if symbol reversed)
47 Roundabout
48 Uneven road
49 Plate below some signs
50 Two-way traffic crosses one-way road
51 Two-way traffic straight ahead
52 Opening or swing bridge ahead
53 Low-flying aircraft or sudden aircraft noise
54 Falling or fallen rocks
55 Traffic signals not in use
56 Traffic signals
57 Slippery road
58 Steep hill downwards
59 Steep hill upwards
60 Tunnel ahead
61 Trams crossing ahead
62 Level crossing with barrier or gate ahead
63 Level crossing without barrier or gate ahead
64 Level crossing without barrier
65 School crossing patrol ahead (some signs have amber lights which flash when children are crossing)
66 Frail (or blind or disabled if shown) pedestrians likely to cross road ahead
67 Pedestrians in road ahead
68 Zebra crossing
69 Overhead electric cable; plate indicates maximum height of vehicles which can pass safely
70 Cattle
71 Wild animals
72 Wild horses or ponies
73 Accompanied horses or ponies
74 Cycle route ahead
75 Risk of ice
76 Traffic queues likely ahead
77 Distance over which road humps extend
78 Other danger; plate indicates nature of danger
79 Soft verges
80 Side winds
81 Hump bridge
82 Worded warning sign
83 Quayside or river bank
84 Risk of grounding
In [19]:
'''
Here I am creating a df that will allow me to pull all junction signs.
"ction" was used instead of "junction" in order to pull all variables.
'''
junction =uktrafficsigns[uktrafficsigns['Sign Description'].str.contains("nction", regex=False)]

#Making it its own HTML object (same as above)

def path_to_image_html(path):
    return '<img src="'+ path + '" width="60" >'

pd.set_option('display.max_colwidth', -1)

HTML(junction.to_html(escape=False ,formatters=dict(Sign=path_to_image_html)))
Out[19]:
Sign Sign Description
41 Junction on bend ahead
42 T-junction with priority over vehicles from the right
43 Staggered junction
In [20]:
#Repeated the above steps for giveways
give=uktrafficsigns[uktrafficsigns['Sign Description'].str.contains("ive ", regex=False)]
def path_to_image_html(path):
    return '<img src="'+ path + '" width="60" >'

pd.set_option('display.max_colwidth', -1)

HTML(give.to_html(escape=False ,formatters=dict(Sign=path_to_image_html)))
Out[20]:
Sign Sign Description
5 Stop and give way
6 Give way to traffic on major road
20 Give priority to vehicles from opposite direction
30 Mini-roundabout (roundabout circulation - give way to vehicles from the immediate right)
39 Distance to 'Give Way' line ahead
In [21]:
#roundabouts
roundabout=uktrafficsigns[uktrafficsigns['Sign Description'].str.contains("ounda", regex=False)]

def path_to_image_html(path):
    return '<img src="'+ path + '" width="60" >'

pd.set_option('display.max_colwidth', -1)

HTML(roundabout.to_html(escape=False ,formatters=dict(Sign=path_to_image_html)))
Out[21]:
Sign Sign Description
30 Mini-roundabout (roundabout circulation - give way to vehicles from the immediate right)
47 Roundabout

Mapping of Problem Areas

Below we used Tableau to map what could be deemed problem areas for the UK. These are accidents in areas with high deprivation (driver_imd_decile @ more deprived 40-50%) and no signange at T or staggered junctions.

In [21]:
%%HTML

<div class='tableauPlaceholder' id='viz1572177057382' style='position: relative'><noscript><a href='https:&#47;&#47;github.com&#47;GenTaylor&#47;Traffic-Accident-Analysis'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ac&#47;AccidentForecasting&#47;SeriousAccidentsinAreaswithHighDeprivationandNoSignage&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='AccidentForecasting&#47;SeriousAccidentsinAreaswithHighDeprivationandNoSignage' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ac&#47;AccidentForecasting&#47;SeriousAccidentsinAreaswithHighDeprivationandNoSignage&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1572177057382');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

Machine Learning

In [3]:
#made separate dataframe w. set index that wouldnt effect data vis above
df1=df
#set index to accident_index
df1.set_index('accident_index', inplace=True)
df1.head()
Out[3]:
1st_road_class 1st_road_number 2nd_road_number accident_severity carriageway_hazards date day_of_week did_police_officer_attend_scene_of_accident junction_control junction_detail ... vehicle_type was_vehicle_left_hand_drive x1st_point_of_impact month weekend hour time_of_day season engine_capacity_cc_size accident_seriousness
accident_index
201001BS70003 B 302 0 Slight None 2010-01-11 Monday 1 Give way or uncontrolled T or staggered junction ... Goods Vehicle No Front 1 0 7 1 winter small engine cc Not Serious
201001BS70004 A 402 4204 Slight None 2010-01-11 Monday 1 Auto traffic signal T or staggered junction ... Car No Front 1 0 18 6 winter medium engine cc Not Serious
201001BS70007 Unclassified 0 0 Slight None 2010-01-02 Saturday 1 Give way or uncontrolled Mini-roundabout ... Car No Nearside 1 1 21 6 winter medium engine cc Not Serious
201001BS70007 Unclassified 0 0 Slight None 2010-01-02 Saturday 1 Give way or uncontrolled Mini-roundabout ... Car No Front 1 1 21 6 winter small engine cc Not Serious
201001BS70008 A 3217 3220 Slight None 2010-01-04 Monday 1 Auto traffic signal Crossroads ... Car No Nearside 1 0 20 6 winter medium engine cc Not Serious

5 rows Ă— 59 columns

In [4]:
df1 = df1.drop(['accident_severity'],axis=1)
In [5]:
df1.head()
Out[5]:
1st_road_class 1st_road_number 2nd_road_number carriageway_hazards date day_of_week did_police_officer_attend_scene_of_accident junction_control junction_detail latitude ... vehicle_type was_vehicle_left_hand_drive x1st_point_of_impact month weekend hour time_of_day season engine_capacity_cc_size accident_seriousness
accident_index
201001BS70003 B 302 0 None 2010-01-11 Monday 1 Give way or uncontrolled T or staggered junction 51.484087 ... Goods Vehicle No Front 1 0 7 1 winter small engine cc Not Serious
201001BS70004 A 402 4204 None 2010-01-11 Monday 1 Auto traffic signal T or staggered junction 51.509212 ... Car No Front 1 0 18 6 winter medium engine cc Not Serious
201001BS70007 Unclassified 0 0 None 2010-01-02 Saturday 1 Give way or uncontrolled Mini-roundabout 51.513314 ... Car No Nearside 1 1 21 6 winter medium engine cc Not Serious
201001BS70007 Unclassified 0 0 None 2010-01-02 Saturday 1 Give way or uncontrolled Mini-roundabout 51.513314 ... Car No Front 1 1 21 6 winter small engine cc Not Serious
201001BS70008 A 3217 3220 None 2010-01-04 Monday 1 Auto traffic signal Crossroads 51.484361 ... Car No Nearside 1 0 20 6 winter medium engine cc Not Serious

5 rows Ă— 58 columns

In [6]:
print(df1.columns)
Index(['1st_road_class', '1st_road_number', '2nd_road_number',
       'carriageway_hazards', 'date', 'day_of_week',
       'did_police_officer_attend_scene_of_accident', 'junction_control',
       'junction_detail', 'latitude', 'light_conditions',
       'local_authority_district', 'local_authority_highway', 'longitude',
       'lsoa_of_accident_location', 'number_of_casualties',
       'number_of_vehicles', 'pedestrian_crossing-human_control',
       'pedestrian_crossing-physical_facilities', 'police_force',
       'road_surface_conditions', 'road_type', 'special_conditions_at_site',
       'speed_limit', 'time', 'urban_or_rural_area', 'weather_conditions',
       'year', 'inscotland', 'age_band_of_driver', 'age_of_vehicle',
       'driver_home_area_type', 'driver_imd_decile', 'engine_capacity_cc',
       'hit_object_in_carriageway', 'hit_object_off_carriageway',
       'journey_purpose_of_driver', 'junction_location', 'make', 'model',
       'propulsion_code', 'sex_of_driver', 'skidding_and_overturning',
       'towing_and_articulation', 'vehicle_leaving_carriageway',
       'vehicle_locationrestricted_lane', 'vehicle_manoeuvre',
       'vehicle_reference', 'vehicle_type', 'was_vehicle_left_hand_drive',
       'x1st_point_of_impact', 'month', 'weekend', 'hour', 'time_of_day',
       'season', 'engine_capacity_cc_size', 'accident_seriousness'],
      dtype='object')

Preprocessing

In [7]:
#separate dtypes
notif=df1.select_dtypes(exclude=['int','float','int64'])
intfldtypes = df1.select_dtypes(include=['int','float','int64'])
print('Objects',notif.columns)
print("\nNonObjects",intfldtypes.columns)

#checking to make sure all are accounted for
print(df1.shape)
print(notif.shape)
print(intfldtypes.shape)
Objects Index(['1st_road_class', '1st_road_number', '2nd_road_number',
       'carriageway_hazards', 'date', 'day_of_week',
       'did_police_officer_attend_scene_of_accident', 'junction_control',
       'junction_detail', 'light_conditions', 'local_authority_district',
       'local_authority_highway', 'lsoa_of_accident_location',
       'number_of_casualties', 'number_of_vehicles',
       'pedestrian_crossing-human_control',
       'pedestrian_crossing-physical_facilities', 'police_force',
       'road_surface_conditions', 'road_type', 'special_conditions_at_site',
       'time', 'urban_or_rural_area', 'weather_conditions', 'inscotland',
       'age_band_of_driver', 'age_of_vehicle', 'driver_home_area_type',
       'hit_object_in_carriageway', 'hit_object_off_carriageway',
       'journey_purpose_of_driver', 'junction_location', 'make', 'model',
       'propulsion_code', 'sex_of_driver', 'skidding_and_overturning',
       'towing_and_articulation', 'vehicle_leaving_carriageway',
       'vehicle_locationrestricted_lane', 'vehicle_manoeuvre',
       'vehicle_reference', 'vehicle_type', 'was_vehicle_left_hand_drive',
       'x1st_point_of_impact', 'weekend', 'hour', 'time_of_day', 'season',
       'engine_capacity_cc_size', 'accident_seriousness'],
      dtype='object')

NonObjects Index(['latitude', 'longitude', 'speed_limit', 'year', 'driver_imd_decile',
       'engine_capacity_cc', 'month'],
      dtype='object')
(561135, 58)
(561135, 51)
(561135, 7)

Label Encoder was used instead of OneHotEncoder due to the memory errors One Hot Encoder caused in the data. The algorithms used will be classifiers, through boosting and trees, and not linear.

In [8]:
#label encode objects
obj_le= notif.apply(LabelEncoder().fit_transform)
#re-add with non-objects
df_ml= pd.concat([obj_le,intfldtypes], axis=1, sort=False)
#check shape
print(df_ml.shape)
(561135, 58)
In [9]:
#Set up of X and Y
X= df_ml.drop(['accident_seriousness'],axis=1)
y= df_ml['accident_seriousness']
In [10]:
df_ml.accident_seriousness.value_counts()
Out[10]:
0    492804
1     68331
Name: accident_seriousness, dtype: int64

Imbalanced Data

The data in this dataset is extremely imbalanced for what we are trying to predict. We are going to resample the data as undersampling, where we reduce the number of majority (Not Serious Accidents) samples.


The machine learning classifier algorithms that we are going to use are as follows:

  • Bagging Classifier (sklearn)
  • AdaBoost Classifier (sklearn)
  • Random Forest Classifier (sklearn)
  • Gradient Boosting Classifier (sklearn)*
  • LightGBM Classifier (LightGBM)
  • XGBoost Classifier (xgboost)
  • Balanced Bagging Classifier(imblearn)
  • Balanced Random Forest Classifier (imblearn)
  • Easy Ensemble Classifier (imblearn)


*Gradient Boosting was commented out because of the time it took to run (18hrs) and not having relevant enough results to still consider.

Resample: Undersampling

In [11]:
# setting up testing and training sets
res_X_train, res_X_test, res_y_train, res_y_test = train_test_split(X, y, 
                                                                    test_size=0.25, random_state=27)
In [12]:
# concatenate our training data back together
res_X = pd.concat([res_X_train, res_y_train], axis=1)
In [13]:
# separate minority and majority classes
not_severe = res_X[res_X.accident_seriousness==0]
severe = res_X[res_X.accident_seriousness==1]
In [14]:
# decrease majority
not_severe_decreased = resample(not_severe,
                          replace=True, # sample with replacement
                          n_samples=len(severe), # match number in majority class
                          random_state=27) # reproducible results
In [15]:
# combine majority and severe_increased minority
newdf = pd.concat([severe, not_severe_decreased])
In [16]:
newdf.accident_seriousness.value_counts()
Out[16]:
1    51357
0    51357
Name: accident_seriousness, dtype: int64
In [17]:
res_X_train = newdf.drop('accident_seriousness', axis=1)
res_y_train = newdf.accident_seriousness

Unsupervised Learning

Before, we get in to predictions, we are going to complete some machine learning in ordered to see how the data relates to each other. We are going to do this on the resampled data as well, in order to avoid bias. We will use two clusters which, in theory, represent the two variables for accident_seriousness, Not Serious & Serious

In [39]:
# "clustering" using kmode algorithm that is designed to handle mixed data
km_huang = KModes(n_clusters=2, init = "Huang", n_init = 1)
fitClusters_huang = km_huang.fit_predict(newdf)
fitClusters_huang
Out[39]:
array([1, 0, 1, ..., 1, 1, 1], dtype=uint16)
In [40]:
newdf1 = newdf.copy().reset_index()
clustersDf = pd.DataFrame(fitClusters_huang)
clustersDf.columns = ['cluster_predicted']
combinedDf = pd.concat([newdf1, clustersDf], axis = 1).reset_index()
combinedDf = combinedDf.drop(['index'], axis = 1)
In [41]:
combinedDf.head()
Out[41]:
accident_index 1st_road_class 1st_road_number 2nd_road_number carriageway_hazards date day_of_week did_police_officer_attend_scene_of_accident junction_control junction_detail ... longitude speed_limit year driver_imd_decile engine_capacity_cc month weekend hour accident_seriousness cluster_predicted
0 201554A415715 0 429 0 1 2148 4 0 2 8 ... -2.110741 30.0 2015 8.0 1299.0 11 0 6 1 1
1 2010440174154 0 27 2391 1 119 0 0 2 6 ... -1.319297 30.0 2010 5.0 1997.0 4 1 0 1 0
2 201506N097860 5 0 0 1 1880 5 0 2 8 ... -2.295013 30.0 2015 8.0 2143.0 2 0 12 1 1
3 2016460101917 0 252 251 1 2426 5 0 2 6 ... 0.878850 40.0 2016 8.0 1560.0 8 0 18 1 0
4 201342I085803 0 120 0 1 1182 4 0 2 8 ... 0.674514 60.0 2013 9.0 1149.0 3 0 19 1 0

5 rows Ă— 60 columns

In [44]:
#plotting a few of these features just to see how they relate to the clustering for seriousness
f, axs = plt.subplots(1,3,figsize = (15,8))
sns.countplot(x=combinedDf['did_police_officer_attend_scene_of_accident'],
              order=combinedDf['did_police_officer_attend_scene_of_accident'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])

sns.countplot(x=combinedDf['x1st_point_of_impact'],
              order=combinedDf['x1st_point_of_impact'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])

sns.countplot(x=combinedDf['number_of_vehicles'],
              order=combinedDf['number_of_vehicles'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot1.png')
plt.show()


f, axs = plt.subplots(1,3,figsize = (15,8))

sns.countplot(x=combinedDf['speed_limit'],
              order=combinedDf['speed_limit'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])
sns.countplot(x=combinedDf['urban_or_rural_area'],
              order=combinedDf['urban_or_rural_area'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])
sns.countplot(x=combinedDf['skidding_and_overturning'],
              order=combinedDf['skidding_and_overturning'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot2.png')
plt.show()



f, axs = plt.subplots(1,3,figsize = (15,8))

sns.countplot(x=combinedDf['vehicle_leaving_carriageway'],
              order=combinedDf['vehicle_leaving_carriageway'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])
sns.countplot(x=combinedDf['sex_of_driver'],
              order=combinedDf['sex_of_driver'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])
sns.countplot(x=combinedDf['vehicle_type'],
              order=combinedDf['vehicle_type'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot3.png')
plt.show()


f, axs = plt.subplots(1,3,figsize = (15,8))

sns.countplot(x=combinedDf['junction_control'],
              order=combinedDf['junction_control'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])
sns.countplot(x=combinedDf['number_of_casualties'],
              order=combinedDf['number_of_casualties'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])
sns.countplot(x=combinedDf['age_band_of_driver'],
              order=combinedDf['age_band_of_driver'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot4.png')
plt.show()



f, axs = plt.subplots(1,3,figsize = (15,8))

sns.countplot(x=combinedDf['junction_detail'],
              order=combinedDf['junction_detail'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])
sns.countplot(x=combinedDf['junction_location'],
              order=combinedDf['junction_location'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])
sns.countplot(x=combinedDf['driver_imd_decile'],
              order=combinedDf['driver_imd_decile'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot5.png')
plt.show()


f, axs = plt.subplots(1,3,figsize = (15,8))

sns.countplot(x=combinedDf['junction_detail'],
              order=combinedDf['junction_detail'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])
sns.countplot(x=combinedDf['junction_location'],
              order=combinedDf['junction_location'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])
sns.countplot(x=combinedDf['driver_imd_decile'],
              order=combinedDf['driver_imd_decile'].value_counts().index,
              hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot6.png')
plt.show()

Looking at these graphs we can see the patterns of how each category of eacch column pairs off with the clustering on accident_seriousness.

Machine Learning with Resampling as Undersampling

Bagging Classifier with Resampling

In [30]:
#start timming
start_bagc_res = time.time()

#Resampled Bagging Classifier
bagc_res = BaggingClassifier(max_features=X.shape[1], n_estimators=500, random_state=42)

bagc_res.fit(res_X_train, res_y_train)
pred_bagc_res = bagc_res.predict(res_X_test)


#Check Scores

print("Resampled Bagging Classifier Accuracy Score: {:0.2f}%".format(accuracy_score(res_y_test,
                                                                               pred_bagc_res )*100))
print("Resampled Bagging Classifier F1 Score: {:0.2f}%".format(f1_score(res_y_test,
                                                                   pred_bagc_res,average="macro")*100))
print("Resampled Bagging Classifier Precision Score: {:0.2f}%".format(precision_score(res_y_test,
                                                                                 pred_bagc_res, 
                                                                                 average="macro")*100))
print("Resampled Bagging Classifier Recall Score: {:0.2f}%".format(recall_score(res_y_test, 
                                                                           pred_bagc_res,
                                                                           average="macro")*100))
print("Resampled Bagging Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(bagc_res, res_X_train, res_y_train, cv=5)*100)))
print('\n')

# Creates a confusion matrix
bagc_res_cm = confusion_matrix(res_y_test,pred_bagc_res)

# Transform to df for easier plotting
bagc_res_cm_df = pd.DataFrame(bagc_res_cm,
                     index = ['Not Serious','Serious'], 
                     columns = ['Not Serious','Serious'])

plt.figure(figsize=(15,5))

sns.heatmap(bagc_res_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled Bagging Classifier Accuracy: {0:.2f}%'
          .format(accuracy_score(res_y_test,pred_bagc_res )*100),fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()

#end time
end_bagc_res = time.time()
print("Resampled Bagging Classifier Time:", end_bagc_res - start_bagc_res)
Resampled Bagging Classifier Accuracy Score: 66.97%
Resampled Bagging Classifier F1 Score: 55.81%
Resampled Bagging Classifier Precision Score: 58.10%
Resampled Bagging Classifier Recall Score: 67.88%
Resampled Bagging Classifier Cross Validation Score: 69.11%


Resampled Bagging Classifier Time: 5531.351397275925
In [31]:
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_bagc_res).ravel()

accuracy = accuracy_score(res_y_test,pred_bagc_res )*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy

print("Resampled Bagging Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled Bagging Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled Bagging Classifier  Error Rate Score: {0:.2f}%".format(ers))
print("Resampled Bagging Classifier Roc Auc Score: {0:.2f}%"
      .format(roc_auc_score(res_y_test,pred_bagc_res)*100))
Resampled Bagging Classifier Specificity Score: 66.68%
Resampled Bagging Classifier False Positive Rate Score: 33.32%
Resampled Bagging Classifier  Error Rate Score: 33.03%
Resampled Bagging Classifier Roc Auc Score: 67.88%

AdaBoost Classifier with Resampling

AdaBoost is a boosting algorithm and is widely used to process imbalanced data. It uses a single-layer decision tree as a weak classifier. In each training iteration, the weight of the misclassified samples generated by the previous iteration is increased, and the weight of the correctly classified samples is reduced, improving the significance of the misclassified samples in the next iteration. Although the AdaBoost algorithm can be directly used to process imbalanced data, the algorithm focuses more on the misclassified samples than samples of minority class. In addition, it may generate many redundant or useless weak classifiers, increasing the processing overhead and causing performance reduction.

With this being said, we will use AdaBoost on the resampled set and below for the class_weight sets we will use it regularly to see how it handles the imbalanced data on it's own vs resampling.

See: Improved PSO_AdaBoost Ensemble Algorithm for Imbalanced Data

In [32]:
#start
start_res_adbc = time.time()


#Resampled AdaBoost Classifier 
res_adbc = AdaBoostClassifier( n_estimators=500, learning_rate=0.05, random_state=42)
res_adbc.fit(res_X_train, res_y_train)
pred_res_adbc = res_adbc.predict(res_X_test)

#Check scores

print("Resampled AdaBoost Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(res_adbc, res_X_train, res_y_train, cv=3)*100)))
print('\n')

# Creates a confusion matrix
res_adbc_cm = confusion_matrix(res_y_test,pred_res_adbc)

# Transform to dataframe for easier plotting
res_adbc_cm_df = pd.DataFrame(res_adbc_cm,
                     index = ['Not Serious','Serious'], 
                     columns = ['Not Serious','Serious'])

plt.figure(figsize=(15,5))

sns.heatmap(res_adbc_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled AdaBoost Classifier Accuracy: {0:.2f}%'
          .format(accuracy_score(res_y_test,pred_res_adbc )*100),fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()

#end time
end_res_adbc = time.time()
print("Resampled AdaBoost Classifier Time:", end_res_adbc - start_res_adbc)
Resampled AdaBoost Classifier Cross Validation Score: 65.73%


Resampled AdaBoost Classifier Time: 389.8358860015869
In [33]:
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_adbc).ravel()

accuracy = accuracy_score(res_y_test,pred_res_adbc)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy

print("Resampled AdaBoost Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled AdaBoost Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled AdaBoost Classifier Error Rate Score: {0:.2f}%".format(ers))
print("Resampled AdaBoost Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(res_y_test,pred_res_adbc )*100))
print("Resampled AdaBoost Classifier F1 Score: {:0.2f}%"
      .format(f1_score(res_y_test, pred_res_adbc,average="macro")*100))
print("Resampled AdaBoost Classifier Precision Score: {:0.2f}%"
      .format(precision_score(res_y_test, pred_res_adbc, average="macro")*100))
print("Resampled AdaBoost Classifier Recall Score: {:0.2f}%"
      .format(recall_score(res_y_test, pred_res_adbc, average="macro")*100))
print("Resampled AdaBoost Classifier Roc Auc Score: {0:.2f}%"
      .format(roc_auc_score(res_y_test,pred_res_adbc)*100))
Resampled AdaBoost Classifier Specificity Score: 67.12%
Resampled AdaBoost Classifier False Positive Rate Score: 32.88%
Resampled AdaBoost Classifier Error Rate Score: 33.26%
Resampled AdaBoost Classifier Accuracy Score: 66.74%
Resampled AdaBoost Classifier F1 Score: 54.90%
Resampled AdaBoost Classifier Precision Score: 57.14%
Resampled AdaBoost Classifier Recall Score: 65.58%
Resampled AdaBoost Classifier Roc Auc Score: 65.58%

Random Forest Classifier with Resampling

In [34]:
#start
start_res_rfc = time.time()

#random forest
res_rfc = RandomForestClassifier(criterion='entropy', max_depth=40,
                                 max_features=X.shape[1], min_samples_split=8,
                                 n_estimators=500, random_state=42)
res_rfc.fit(res_X_train, res_y_train)
pred_res_rfc = res_rfc.predict(res_X_test)


#cv

print("Resampled Random Forest Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(res_rfc, res_X_train, res_y_train, cv=3)*100)))
print('\n')

# Creates a confusion matrix
res_rfc_cm = confusion_matrix(res_y_test,pred_res_rfc)

# Transform to df for easier plotting
res_rfc_cm_df = pd.DataFrame(res_rfc_cm,
                     index = ['Not Serious','Serious'], 
                     columns = ['Not Serious','Serious'])

plt.figure(figsize=(15,5))

sns.heatmap(res_rfc_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled Random Forest Accuracy: {0:.2f}%'.format(accuracy_score(res_y_test,
                                                                            pred_res_rfc)*100),
          fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
end_res_rfc = time.time()
print("\nResampled Random Forest Time: ", end_res_rfc - start_res_rfc)
Resampled Random Forest Classifier Cross Validation Score: 68.85%


Resampled Random Forest Time:  4370.322077035904
In [35]:
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_rfc).ravel()

accuracy = accuracy_score(res_y_test,pred_res_rfc)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy

print("Resampled Random Forest Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled Random Forest Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled Random Forest Classifier Error Rate Score: {0:.2f}%".format(ers))
print("Resampled Random Forest Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(res_y_test,pred_res_rfc )*100))
print("Resampled Random Forest Classifier F1 Score: {:0.2f}%"
      .format(f1_score(res_y_test, pred_res_rfc,average="macro")*100))
print("Resampled Random Forest Classifier Precision Score: {:0.2f}%"
      .format(precision_score(res_y_test, pred_res_rfc, average="macro")*100))
print("Resampled Random Forest Classifier Recall Score: {:0.2f}%"
      .format(recall_score(res_y_test, pred_res_rfc, average="macro")*100))
print("Resampled Random Forest Classifier Roc Auc Score: {0:.2f}%"
      .format(roc_auc_score(res_y_test, pred_res_rfc)*100))
Resampled Random Forest Classifier Specificity Score: 66.84%
Resampled Random Forest Classifier False Positive Rate Score: 33.16%
Resampled Random Forest Classifier Error Rate Score: 32.91%
Resampled Random Forest Classifier Accuracy Score: 67.09%
Resampled Random Forest Classifier F1 Score: 55.87%
Resampled Random Forest Classifier Precision Score: 58.10%
Resampled Random Forest Classifier Recall Score: 67.85%
Resampled Random Forest Classifier Roc Auc Score: 67.85%

Gradient Boosting Classifier with Resampling

In [36]:
# #Resampled Gradient Boosting Classifier was taken out of the running due to its run time of almost a day
# start_res_gbc = time.time()
# res_gbc = ensemble.GradientBoostingClassifier(learning_rate=0.05, max_depth=40,
#                                               min_samples_leaf=1, n_estimators=500,
#                                               random_state = 42)
# res_gbc.fit(res_X_train, res_y_train)
# pred_res_gbc = res_gbc.predict(res_X_test)

# #Check accuracy
# print("Resampled Gradient Boosting Classifier Accuracy Score: {:0.2f}%"
#       .format(accuracy_score(res_y_test,pred_res_gbc )*100))
# print("Resampled Gradient Boosting Classifier F1 Score: {:0.2f}%"
#       .format(f1_score(res_y_test, pred_res_gbc,average="macro")*100))
# print("Resampled Gradient Boosting Classifier Precision Score: {:0.2f}%"
#       .format(precision_score(res_y_test, pred_res_gbc, average="macro")*100))
# print("Resampled Gradient Boosting Classifier Recall Score: {:0.2f}%"
#       .format(recall_score(res_y_test, pred_res_gbc, average="macro")*100))
# print("Resampled Gradient Boosting Classifier Cross Validation Score: {:0.2f}%"
#       .format(np.mean(cross_val_score(res_gbc, res_X_train, res_y_train, cv=5)*100)))
# print('\n')

# # Creates a confusion matrix
# res_gbc_cm = confusion_matrix(res_y_test,pred_res_gbc)

# # Transform to df for easier plotting
# res_gbc_cm_df = pd.DataFrame(res_gbc_cm,
#                      index = ['Not Serious','Serious'], 
#                      columns = ['Not Serious','Serious'])

# plt.figure(figsize=(15,5))

# sns.heatmap(res_gbc_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
# plt.title('Resampled Gradient Boosting Classifier Accuracy: {0:.2f}%'.format(accuracy_score(res_y_test,
#                                                                             pred_res_gbc)*100),
#           fontsize=15)
# plt.ylabel('Actual\n')
# plt.xlabel('Predicted\n')
# plt.show()
# end_res_gbc = time.time()
# print("\nResampled Gradient Boosting Time: ", end_res_gbc - start_res_gbc)
# print("Resampled Gradient Boosting Time:", end_res_gbc - start_res_gbc)



#place results for Gradient Boosting here (from machine learning notebook) but do NOT re-run.

# Resampled Gradient Boosting Classifier Accuracy Score: 58.26%
# Resampled Gradient Boosting Classifier F1 Score: 48.58%
# Resampled Gradient Boosting Classifier Precision Score: 54.15%
# Resampled Gradient Boosting Classifier Recall Score: 59.65%
# Resampled Gradient Boosting Classifier Cross Validation Score: 61.43%

# Resampled Gradient Boosting Time:  67961.71300411224

# Confusion Matrix: 
# [[71301,52009],
# [6540,10434]]

LightGBM Classifier with Resampling

In [18]:
#Light GBM

start_res_lgbm = time.time()
res_lgbm = lgb.LGBMClassifier(learning_rate =0.03, max_depth=40, min_data_in_leaf=10,
                           n_estimators=500, num_leaves=50, random_state = 42)
res_lgbm.fit(res_X_train, res_y_train)
pred_res_lgbm = res_lgbm.predict(res_X_test)

#check cv
print("Resampled LightGBM Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(res_lgbm, res_X_train, res_y_train, cv=5)*100)))
print('\n')


res_lgbm_cm = confusion_matrix(res_y_test, pred_res_lgbm)
# Transform to df for easier plotting
res_lgbm_cm_df = pd.DataFrame(res_lgbm_cm,
                     index = ['Not Serious','Serious'], 
                     columns = ['Not Serious','Serious'])

plt.figure(figsize=(15,5))

sns.heatmap(res_lgbm_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled LightGBM Accuracy: {0:.2f}%'.format(accuracy_score(res_y_test,
                                                                            pred_res_lgbm)*100),
          fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
end_res_lgbm = time.time()
print("\nResampled LightGBM Time: ", end_res_lgbm - start_res_lgbm)
Resampled LightGBM Classifier Cross Validation Score: 68.32%


Resampled LightGBM Time:  61.45835494995117
In [19]:
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_lgbm).ravel()

accuracy = accuracy_score(res_y_test,pred_res_lgbm)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy

print("Resampled LightGBM Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled LightGBM Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled LightGBM Classifier Error Rate Score: {0:.2f}%".format(ers))
#check accuracy
print("Resampled LightGBM Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(res_y_test,pred_res_lgbm )*100))
print("Resampled LightGBM Classifier F1 Score: {:0.2f}%"
      .format(f1_score(res_y_test, pred_res_lgbm,average="macro")*100))
print("Resampled LightGBM Classifier Precision Score: {:0.2f}%"
      .format(precision_score(res_y_test, pred_res_lgbm, average="macro")*100))
print("Resampled LightGBM Classifier Recall Score: {:0.2f}%"
      .format(recall_score(res_y_test, pred_res_lgbm, average="macro")*100))
print("Resampled LightGBM Classifier Roc Auc Score: {0:.2f}%"
      .format(roc_auc_score(res_y_test, pred_res_lgbm)*100))
Resampled LightGBM Classifier Specificity Score: 67.74%
Resampled LightGBM Classifier False Positive Rate Score: 32.26%
Resampled LightGBM Classifier Error Rate Score: 32.19%
Resampled LightGBM Classifier Accuracy Score: 67.81%
Resampled LightGBM Classifier F1 Score: 56.33%
Resampled LightGBM Classifier Precision Score: 58.27%
Resampled LightGBM Classifier Recall Score: 68.04%
Resampled LightGBM Classifier Roc Auc Score: 68.04%

XGBoost Classifier with Resampling

In [20]:
#XGBoost
start_res_xgb = time.time()
res_xgb = XGBClassifier(learning_rate=0.05, n_estimators=500, subsample= 1,random_state = 42,
                        gamma = 1, max_depth=40)
res_xgb.fit(res_X_train, res_y_train)

pred_res_xgb = res_xgb.predict(res_X_test)

#check accuracy
print("Resampled XGBoost Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(res_xgb, res_X_train, res_y_train, cv=3)*100)))
print('\n')
# Transform to df for easier plotting of confusion matrix
res_xgb_cm = confusion_matrix(res_y_test, pred_res_xgb)
res_xgb_cm_df = pd.DataFrame(res_xgb_cm,
                     index = ['Not Serious','Serious'], 
                     columns = ['Not Serious','Serious'])

plt.figure(figsize=(15,5))

sns.heatmap(res_xgb_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled XGBoost Accuracy: {0:.2f}%'.format(accuracy_score(res_y_test,
                                                                            pred_res_xgb)*100),
          fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()

end_res_xgb = time.time()
print("Resampled XGBoost Time:", end_res_xgb - start_res_xgb)
Resampled XGBoost Classifier Cross Validation Score: 68.79%


Resampled XGBoost Time: 4441.273263931274
In [21]:
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_xgb).ravel()

accuracy = accuracy_score(res_y_test,pred_res_xgb)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy

print("Resampled XGBoost Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled XGBoost Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled XGBoost Classifier Error Rate Score: {0:.2f}%".format(ers))
print("Resampled XGBoost Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(res_y_test,pred_res_xgb)*100))
print("Resampled XGBoost Classifier F1 Score: {:0.2f}%"
      .format(f1_score(res_y_test, pred_res_xgb,average="macro")*100))
print("Resampled XGBoost Classifier Precision Score: {:0.2f}%"
      .format(precision_score(res_y_test, pred_res_xgb, average="macro")*100))
print("Resampled XGBoost Classifier Recall Score: {:0.2f}%"
      .format(recall_score(res_y_test, pred_res_xgb, average="macro")*100))
print("Resampled XGBoost Classifier Roc Auc Score: {0:.2f}%"
      .format(roc_auc_score(res_y_test, pred_res_xgb)*100))
Resampled XGBoost Classifier Specificity Score: 66.38%
Resampled XGBoost Classifier False Positive Rate Score: 33.62%
Resampled XGBoost Classifier Error Rate Score: 33.20%
Resampled XGBoost Classifier Accuracy Score: 66.80%
Resampled XGBoost Classifier F1 Score: 55.79%
Resampled XGBoost Classifier Precision Score: 58.17%
Resampled XGBoost Classifier Recall Score: 68.10%
Resampled XGBoost Classifier Roc Auc Score: 68.10%

For the following "Balanced" algorithms from imblearn we will be using the standard testing and training sets (X_train, X_test, y_train, y_test) and will allow the algorithms to do the resampling.

For the sampling_strategy, we will be using "majority" as the solution.

'majority': resample only the majority class

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)

Balanced Bagging Classifier

In [43]:
#start
start_res_bbag = time.time()

# Balanced Bagging Classifier
res_bbag = BalancedBaggingClassifier(max_features=X.shape[1], n_estimators=500, replacement=True, 
                                     sampling_strategy='majority', random_state=42)

res_bbag.fit(X_train, y_train)
pred_res_bbag = res_bbag.predict(X_test)


# Creates a confusion matrix
res_bbag_cm = confusion_matrix(y_test,pred_res_bbag)

# Transform to df for easier plotting
res_bbag_cm_df = pd.DataFrame(res_bbag_cm,
                     index = ['Not Serious','Serious'], 
                     columns = ['Not Serious','Serious'])

plt.figure(figsize=(15,5))

sns.heatmap(res_bbag_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled Balanced Bagging Accuracy: {0:.2f}%'.format(accuracy_score(y_test,pred_res_bbag )*100),
          fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
print("Resampled Balanced Bagging Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(res_bbag, X_train, y_train, cv=5)*100)))
print('\n')
#end
end_res_bbag = time.time()
print("\nResampled Balanced Bagging Time: ",end_res_bbag - start_res_bbag)
Resampled Balanced Bagging Classifier Cross Validation Score: 78.47%



Resampled Balanced Bagging Time:  12142.180311203003
In [44]:
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(y_test,pred_res_bbag).ravel()

accuracy = accuracy_score(y_test,pred_res_bbag)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy

print("Resampled Balanced Bagging Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled Balanced Bagging Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled Balanced Bagging Classifier Error Rate Score: {0:.2f}%".format(ers))

#Check scores
print("Resampled Balanced Bagging Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_res_bbag )*100))
print("Resampled Balanced Bagging Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_res_bbag,average="macro")*100))
print("Resampled Balanced Bagging Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_res_bbag, average="macro")*100))
print("Resampled Balanced Bagging Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_res_bbag, average="macro")*100))
print("Resampled Balanced Bagging  Classifier Roc Auc Score: {0:.2f}%"
      .format(roc_auc_score(y_test, pred_res_bbag)*100))
Resampled Balanced Bagging Classifier Specificity Score: 82.21%
Resampled Balanced Bagging Classifier False Positive Rate Score: 17.79%
Resampled Balanced Bagging Classifier Error Rate Score: 21.47%
Resampled Balanced Bagging Classifier Accuracy Score: 78.53%
Resampled Balanced Bagging Classifier F1 Score: 61.97%
Resampled Balanced Bagging Classifier Precision Score: 60.58%
Resampled Balanced Bagging Classifier Recall Score: 67.01%
Resampled Balanced Bagging  Classifier Roc Auc Score: 67.01%

Resampled Easy Ensemble Classifier (Imblearn's AdaBoost)

In [45]:
#start
start_res_eec = time.time()

#EasyEnsembleClassifier
res_eec = EasyEnsembleClassifier(n_estimators=500, random_state=42, replacement=True, 
                                 sampling_strategy='majority')

res_eec.fit(X_train, y_train)
pred_res_eec = res_eec.predict(X_test)

print("Resampled Balanced Easy Ensemble Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(res_eec, X_train, y_train, cv=5)*100)))
print('\n')
# Creates a confusion matrix
res_eec_cm = confusion_matrix(y_test,pred_res_eec)

# Transform to df for easier plotting
res_eec_cm_df = pd.DataFrame(res_eec_cm,
                     index = ['Not Serious','Serious'], 
                     columns = ['Not Serious','Serious'])

plt.figure(figsize=(15,5))

sns.heatmap(res_eec_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled Balanced Easy Ensemble Accuracy: {0:.2f}%'.format(accuracy_score(y_test,pred_res_eec )*100),
          fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()

#end
end_res_eec = time.time()
print("\nResampled Balanced Easy Ensemble Time: ",end_res_eec - start_res_eec)
Resampled Balanced Easy Ensemble Classifier Cross Validation Score: 66.83%


Resampled Balanced Easy Ensemble Time:  37473.19004392624
In [46]:
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_eec).ravel()

accuracy = accuracy_score(res_y_test,pred_res_eec)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy

print("Resampled Balanced Easy Ensemble Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled Balanced Easy Ensemble Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled Balanced Easy Ensemble Classifier Error Rate Score: {0:.2f}%".format(ers))

#Check accuracy
print("Resampled Balanced Easy Ensemble Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_res_eec )*100))
print("Resampled Balanced Easy Ensemble Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_res_eec,average="macro")*100))
print("Resampled Balanced Easy Ensemble Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_res_eec, average="macro")*100))
print("Resampled Balanced Easy Ensemble Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_res_eec, average="macro")*100))
print("Resampled Balanced Easy Ensemble  Classifier Roc Auc Score: {0:.2f}%"
      .format(roc_auc_score(y_test, pred_res_eec)*100))
Resampled Balanced Easy Ensemble Classifier Specificity Score: 66.82%
Resampled Balanced Easy Ensemble Classifier False Positive Rate Score: 33.18%
Resampled Balanced Easy Ensemble Classifier Error Rate Score: 33.39%
Resampled Balanced Easy Ensemble Classifier Accuracy Score: 66.61%
Resampled Balanced Easy Ensemble Classifier F1 Score: 54.96%
Resampled Balanced Easy Ensemble Classifier Precision Score: 57.27%
Resampled Balanced Easy Ensemble Classifier Recall Score: 65.95%
Resampled Balanced Easy Ensemble  Classifier Roc Auc Score: 65.95%

Resampled Balanced Random Forest Classifier

In [47]:
#start
start_res_brfc = time.time()

# Balanced Random Forest Classifier
res_brfc = BalancedRandomForestClassifier(criterion='entropy', max_depth=40,
                                          min_samples_leaf = 1, max_features=X.shape[1], 
                                          sampling_strategy='majority', replacement=True,
                                          min_samples_split=8, n_estimators=500, 
                                          random_state=42)

res_brfc.fit(X_train, y_train)
pred_res_brfc = res_brfc.predict(X_test)

# Creates a confusion matrix
res_brfc_cm = confusion_matrix(y_test,pred_res_brfc)

# Transform to df for easier plotting
res_brfc_cm_df = pd.DataFrame(res_brfc_cm,
                     index = ['Not Serious','Serious'], 
                     columns = ['Not Serious','Serious'])

plt.figure(figsize=(15,5))

sns.heatmap(res_brfc_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled Balanced Random Forest Accuracy: {0:.2f}%'.format(accuracy_score(y_test,pred_res_brfc )*100),
          fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
print("Resampled Balanced Random Forest Classifier Cross Validation Score: {:0.2f}%"
      .format(np.mean(cross_val_score(res_brfc, X_train, y_train, cv=5)*100)))
print('\n')

#end
end_res_brfc = time.time()
print("\nResampled Balanced Random Forest Time: ",end_res_brfc - start_res_brfc)
Resampled Balanced Random Forest Classifier Cross Validation Score: 67.28%



Resampled Balanced Random Forest Time:  7261.670822143555
In [48]:
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_brfc).ravel()

accuracy = accuracy_score(res_y_test,pred_res_brfc)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy

print("Resampled Balanced Random Forest Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled Balanced Random Forest Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled Balanced Random Forest Classifier Error Rate Score: {0:.2f}%".format(ers))
#Check accuracy


#Check accuracy
print("Resampled Balanced Random Forest Classifier Accuracy Score: {:0.2f}%"
      .format(accuracy_score(y_test,pred_res_brfc )*100))
print("Resampled Balanced Random Forest Classifier F1 Score: {:0.2f}%"
      .format(f1_score(y_test, pred_res_brfc,average="macro")*100))
print("Resampled Balanced Random Forest Classifier Precision Score: {:0.2f}%"
      .format(precision_score(y_test, pred_res_brfc, average="macro")*100))
print("Resampled Balanced Random Forest Classifier Recall Score: {:0.2f}%"
      .format(recall_score(y_test, pred_res_brfc, average="macro")*100))
print("Resampled Balanced Random Forest Classifier Roc Auc Score: {0:.2f}%"
      .format(roc_auc_score(y_test, pred_res_brfc)*100))
Resampled Balanced Random Forest Classifier Specificity Score: 66.96%
Resampled Balanced Random Forest Classifier False Positive Rate Score: 33.04%
Resampled Balanced Random Forest Classifier Error Rate Score: 32.72%
Resampled Balanced Random Forest Classifier Accuracy Score: 67.28%
Resampled Balanced Random Forest Classifier F1 Score: 56.12%
Resampled Balanced Random Forest Classifier Precision Score: 58.30%
Resampled Balanced Random Forest Classifier Recall Score: 68.28%
Resampled Balanced Random Forest Classifier Roc Auc Score: 68.28%

Machine Learning Results

Below we have compiled a dataframe and visualization of the scores above in order to determine which algorithm would be best for this data.

In [22]:
#create list of results
results_data={'Learning Algorithm':['Bagging','AdaBoost', 'Random Forest', 'LightGBM','XGBoost',
                                    'Balanced Bagging', 'Easy Ensemble', 'Balanced Random Forest'],
              'Accuracy  Score':[66.97,66.74,67.09,67.81,66.8,78.53,66.61,67.28],
              'F1 Score ':[55.81,54.9,55.87,56.33,55.79,61.97,54.96,56.12],
              'Precision Score':[58.1,57.14,58.1,58.27,58.17,60.58,57.27,58.3],
              'Recall Score':[67.88,65.58,67.85,68.04,68.1,67.01,65.95,68.28],
              'Cross Validation Score':[69.11,65.73,69.15,68.32,69.24,78.47,66.83,67.28],
              'Specificity Score':[66.68,67.12,66.84,67.74,66.38,82.21,66.82,66.96], 
              'Error Rate':[33.03,33.26,32.91,32.19,33.2,17.79,33.39,32.72],
              'False Positive Rate':[33.32,32.88,33.16,32.26,33.62,21.47,33.18,33.04],
              'Roc Auc Score':[67.88,65.58,67.85,68.04,68.1,67.01,65.95,68.28],
              'Time':[5531.351397,389.835886,4370.322077,61.45835494995117,4441.273263931274,
                      12142.18031,37473.19004,7261.670822],
              'Learning Library':['Sklearn', 'Sklearn', 'Sklearn', 'LightGBM', 'XGBoost',
                                  'Imblearn', 'Imblearn', 'Imblearn']}
#create dataframe
results=pd.DataFrame(results_data) 

results.head(10)
Out[22]:
Learning Algorithm Accuracy Score F1 Score Precision Score Recall Score Cross Validation Score Specificity Score Error Rate False Positive Rate Roc Auc Score Time Learning Library
0 Bagging 66.97 55.81 58.10 67.88 69.11 66.68 33.03 33.32 67.88 5531.351397 Sklearn
1 AdaBoost 66.74 54.90 57.14 65.58 65.73 67.12 33.26 32.88 65.58 389.835886 Sklearn
2 Random Forest 67.09 55.87 58.10 67.85 69.15 66.84 32.91 33.16 67.85 4370.322077 Sklearn
3 LightGBM 67.81 56.33 58.27 68.04 68.32 67.74 32.19 32.26 68.04 61.458355 LightGBM
4 XGBoost 66.80 55.79 58.17 68.10 69.24 66.38 33.20 33.62 68.10 4441.273264 XGBoost
5 Balanced Bagging 78.53 61.97 60.58 67.01 78.47 82.21 17.79 21.47 67.01 12142.180310 Imblearn
6 Easy Ensemble 66.61 54.96 57.27 65.95 66.83 66.82 33.39 33.18 65.95 37473.190040 Imblearn
7 Balanced Random Forest 67.28 56.12 58.30 68.28 67.28 66.96 32.72 33.04 68.28 7261.670822 Imblearn
In [23]:
#change time to minutes
results['Time in Minutes'] = round(results['Time']/60, 2)

#drop actual Time column
results=results.drop('Time',axis=1)

#rearrange columns
results = results[['Learning Algorithm', 'Accuracy  Score', 'F1 Score ', 'Precision Score',
                   'Recall Score', 'Cross Validation Score', 'Specificity Score', 'Error Rate',
                   'False Positive Rate','Roc Auc Score','Time in Minutes', 'Learning Library']]
results.set_index('Learning Algorithm', inplace=True)
results.head(10)
Out[23]:
Accuracy Score F1 Score Precision Score Recall Score Cross Validation Score Specificity Score Error Rate False Positive Rate Roc Auc Score Time in Minutes Learning Library
Learning Algorithm
Bagging 66.97 55.81 58.10 67.88 69.11 66.68 33.03 33.32 67.88 92.19 Sklearn
AdaBoost 66.74 54.90 57.14 65.58 65.73 67.12 33.26 32.88 65.58 6.50 Sklearn
Random Forest 67.09 55.87 58.10 67.85 69.15 66.84 32.91 33.16 67.85 72.84 Sklearn
LightGBM 67.81 56.33 58.27 68.04 68.32 67.74 32.19 32.26 68.04 1.02 LightGBM
XGBoost 66.80 55.79 58.17 68.10 69.24 66.38 33.20 33.62 68.10 74.02 XGBoost
Balanced Bagging 78.53 61.97 60.58 67.01 78.47 82.21 17.79 21.47 67.01 202.37 Imblearn
Easy Ensemble 66.61 54.96 57.27 65.95 66.83 66.82 33.39 33.18 65.95 624.55 Imblearn
Balanced Random Forest 67.28 56.12 58.30 68.28 67.28 66.96 32.72 33.04 68.28 121.03 Imblearn
In [24]:
#csv file for Tableau 
results.to_csv('learning_results.csv') 
Scores
In [24]:
%%HTML
<div class='tableauPlaceholder' id='viz1572177218898' style='position: relative'><noscript><a href='https:&#47;&#47;github.com&#47;GenTaylor&#47;Traffic-Accident-Analysis'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Le&#47;LearningAlgorithmResults&#47;LearningAlgorithmsScores&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='LearningAlgorithmResults&#47;LearningAlgorithmsScores' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Le&#47;LearningAlgorithmResults&#47;LearningAlgorithmsScores&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1572177218898');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
Rates
In [25]:
%%HTML
<div class='tableauPlaceholder' id='viz1572079997269' style='position: relative'><noscript><a href='https:&#47;&#47;github.com&#47;GenTaylor&#47;Traffic-Accident-Analysis'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Le&#47;LearningAlgorithmResults&#47;LearningAlgorithmsRates&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='LearningAlgorithmResults&#47;LearningAlgorithmsRates' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Le&#47;LearningAlgorithmResults&#47;LearningAlgorithmsRates&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='useGuest' value='true' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1572079997269');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
Timing
In [26]:
%%HTML
<div class='tableauPlaceholder' id='viz1572080028730' style='position: relative'><noscript><a href='https:&#47;&#47;github.com&#47;GenTaylor&#47;Traffic-Accident-Analysis'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Le&#47;LearningAlgorithmResults&#47;LearningAlgorithmsTime&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='LearningAlgorithmResults&#47;LearningAlgorithmsTime' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Le&#47;LearningAlgorithmResults&#47;LearningAlgorithmsTime&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='useGuest' value='true' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1572080028730');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
Choice

Based on the visualizations above, Balanced Bagging Classifier from imblearn is the algorithm of choice for this data. While some of the scores may have been close, Balanced Bagging Classifier had higher scores in Accuracy, Cross Validation, and Specificity. The algorithm also had the lower Error Rate and False Positive Rates of the group. It’s prediction of Serious accidents was close to being overall inaccurate but in the end, I was comfortable with the findings.